Toloka

Toloka
Type of site	Platform
Founded	2014; 10 years ago
Owner	Nebius Group
Founder(s)	Olga Megorskaya
Industry	Artificial intelligence; Information technology
URL	toloka.ai

Toloka, based in Amsterdam, is a crowdsourcing and generative AI services provider.^[1]

The company helps development of artificial intelligence from training to evaluation and provides generative artificial intelligence and large language model-related services.^[3]^[4]

History

Toloka was founded in 2014 by Olga Megorskaya, a member of the board of directors of Yandex, as a crowdsourcing and microtasking platform.^[5] It was founded primarily for data markup to improve machine learning and search algorithms

As generative AI evolved, the platform adapted to provide expert data labeling to generational AI app producers.^[6]

In 2024, the company's Russian operations were sold to Russian investors.^[4]^[7]

Services

Generative AI

In the generative AI domain, Toloka provides services such as model fine tuning, reinforcement learning from human feedback, evaluation, adhoc datasets, which require large volumes of highly skilled experts annotation.

Machine learning

On Toloka, trainers are tasked with identifying the presence or absence of objects in content, as specified by algorithms.^[5]^[8] They also assess chatbot responses within given dialogues for relevance and engagement.^[9] Additionally, translation verification tasks involve evaluating the accuracy of translations from multiple annotators. For the fine-tuning of large language models (LLMs), experts are required to generate and provide context-based prompts that can be single-turn or multi-turn, serving various domains and purposes.

Natural language processing

In the natural language processing (NLP) domain, Toloka facilitates optical character recognition and classification, sentiment analysis, named-entity recognition, and search relevance evaluation. It also provides transcription and classification of audio data.^[5]

Annotators

Toloka mainly works with domain experts, such as physicists, scientists, lawyers, and software engineers, to develop specialized data for models targeting niche tasks.^[1] Toloka also works with freelancers, referred to as "Tolokers," who annotate and create data for diverse applications.^[1] They perform tasks such as labeling personally identifiable information for AI projects, translating content, summarizing information, and transcribing audio to text.^[1]

Upon completion of each task the performer receives a reward based on the volume of images, videos, and unstructured text.^[5]

Research

In May 2019, Toloka's research team began publishing datasets for non-commercial and academic purposes to support the scientific community and attract researchers to Toloka. Such datasets are addressed to researchers in different directions like linguistics, computer vision, testing of result aggregation models, and chatbot training.^[10]

Toloka research has been showcased at a range of conferences, including the Conference on Neural Information Processing Systems (NeurIPS),^[10] the International Conference on Machine Learning (ICML)^[11] and the International Conference on Very Large Data Bases (VLDB).^[12]

In February 2024, Toloka conducted a tutorial at the AAAI Conference on Artificial Intelligence, focusing on aligning Large Language Models to Low-Resource Languages.^[13]

The company participated in BigCode, a joint scientific initiative led by HuggingFace and ServiceNow, where it served as the primary data partner.^[14]

Controversies

Enabling arrests of protesters via facial recognition software (March 2024)

In March 2024, Toloka's Russian division was criticized for helping develop the facial recognition software used by Russia to track and arrest protesters after the death of Alexei Navalny.^[15] The company's Russian operations were sold in July 2024.

References

^ ^a ^b ^c ^d ^e Shrivastava, Rashi (July 24, 2024). "The Internet Isn't Big Enough To Train AI. One Fix? Fake Data". Forbes.
^ Sacolick, Isaac (April 8, 2024). "How to test large language models". InfoWorld.
^ "AI development from training to evaluation". Bloomberg News. July 16, 2024.
^ ^a ^b Sawers, Paul (July 21, 2024). "From Yandex's ashes comes Nebius, a 'startup' with plans to be a European AI compute leader". TechCrunch.
^ ^a ^b ^c ^d Woodie, Alex (April 27, 2021). "Toloka Expands Data Labeling Service". Datanami.
^ Baidakova, Daria (September 29, 2021). "Data-Labeling Instructions: Gateway to Success in Crowdsourcing and Enduring Impact on AI". Data Science Central.
^ "Yandex founder to build AI business in Europe after Russia exit". Financial Times. July 16, 2024.
^ Bussler, Frederik (December 7, 2021). "Data labeling will fuel the AI revolution". VentureBeat.
^ Gandharv, Kumar (April 29, 2021). "Why Are Data Labelling Firms Eyeing Indian Market?". Analytics India Magazine.
^ ^a ^b "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". FE News. November 18, 2021.
^ "Toloka". icml.cc.
^ "VLDB 2021 Challenge". crowdscience.ai.
^ "The 38th Annual AAAI Conference on Artificial Intelligence". AAAI Conference on Artificial Intelligence.
^ "BigCode Governance Card". arXiv.
^ "Dutch Yandex subsidiary helping Russia with facial recognition software". NL Times. 27 March 2024.

External links

Official website

[FakeData-1] Shrivastava, Rashi (July 24, 2024). "The Internet Isn't Big Enough To Train AI. One Fix? Fake Data". Forbes.

[2] Sacolick, Isaac (April 8, 2024). "How to test large language models". InfoWorld.

[3] "AI development from training to evaluation". Bloomberg News. July 16, 2024.

[leader-4] Sawers, Paul (July 21, 2024). "From Yandex's ashes comes Nebius, a 'startup' with plans to be a European AI compute leader". TechCrunch.

[Expands-5] Woodie, Alex (April 27, 2021). "Toloka Expands Data Labeling Service". Datanami.

[6] Baidakova, Daria (September 29, 2021). "Data-Labeling Instructions: Gateway to Success in Crowdsourcing and Enduring Impact on AI". Data Science Central.

[7] "Yandex founder to build AI business in Europe after Russia exit". Financial Times. July 16, 2024.

[8] Bussler, Frederik (December 7, 2021). "Data labeling will fuel the AI revolution". VentureBeat.

[9] Gandharv, Kumar (April 29, 2021). "Why Are Data Labelling Firms Eyeing Indian Market?". Analytics India Magazine.

[Ng-10] "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". FE News. November 18, 2021.

[11] "Toloka". icml.cc.

[12] "VLDB 2021 Challenge". crowdscience.ai.

[13] "The 38th Annual AAAI Conference on Artificial Intelligence". AAAI Conference on Artificial Intelligence.

[14] "BigCode Governance Card". arXiv.

[15] "Dutch Yandex subsidiary helping Russia with facial recognition software". NL Times. 27 March 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]