Data annotation, or the process of adding labels to images, text, audio, and other forms of data samples, is typically a key step in the development of AI systems. . The vast majority of systems learn to make predictions by associating labels with specific data samples, such as the caption “bear” with a photo of a black bear. A system trained on many labeled examples of different types of contracts, for example, would eventually learn to distinguish between those contracts and even extrapolate to contracts it had never seen before.
The problem is that annotation is a laborious, manual process that has historically been assigned to gig workers on platforms like Amazon Mechanical Turk. But with the growing interest in AI – and the data used to train that AI – an entire industry has grown up around annotation and labeling tools.
Dataloop, one of several startups vying to gain a foothold in the nascent market, today announced it has raised $33 million in a Series B funding round led by Nokia Growth Partners (NGP) Capital and Alpha Global Wave. Dataloop develops software and services to automate aspects of data preparation, with the goal of saving time on the AI system development process.
“I worked at Intel for over 13 years, and that’s where I met Dataloop’s second co-founder and CPO, Avi Yashar,” Dataloop CEO Eran Shlomo told TechCrunch in an email interview. “With Avi, I left Intel and founded Dataloop. Nir [Buschi]our CBO, joined us as the third co-founder, after having held management positions [at] technology companies and [lead] business and go-to-market in venture-backed startups.
Dataloop initially focused on data annotation for computer vision and video analytics. But in recent years, the company has added new tools for text, audio, form, and document data and allowed customers to integrate custom data applications developed in-house.
One of the most recent additions to the Dataloop platform are data management dashboards for unstructured data. (as opposed to structure data, or data organized in a standardized format, unstructured data is not organized according to a common pattern or schema.) Each provides tools for data versioning and metadata search, as well as a query language for querying datasets and visualizing sample data.
“All AI models are learned from humans through the process of data labeling. The labeling process is essentially a knowledge encoding process in which a human teaches the rules to the machine using sample data positives and negatives,” Shlomo said. “The primary goal of every AI application is to create the ‘data flying effect’ using its customers’ data: a better product leads to more users, more data and therefore a better product.”
Dataloop competes with heavyweights in the data annotation and labeling space, including Scale AI, which has raised over $600 million in venture capital. Labelbox is another major rival, having recently raised more than $110 million in a funding round led by SoftBank. Beyond the realm of startups, tech giants including Google, Amazon, Snowflake, and Microsoft offer their own data annotation services.
Dataloop must be doing something right. Shlomo says the company currently has “hundreds” of customers in retail, agriculture, robotics, autonomous vehicles and construction, though he declined to reveal the numbers. of his income.
An open question is whether Dataloop’s platform solves some of the major challenges that exist in data labeling today. Last year, a paper published by MIT revealed that data labeling tends to be very inconsistent, which could affect the accuracy of AI systems. A growing body of academic research suggests that annotators introduce their own biases when labeling data – for example, labeling sentences in African American English (a modern dialect spoken primarily by Black Americans) as more toxic than equivalents. General American English. These prejudices often manifest themselves in unfortunate ways; think about moderation algorithms that are more likely to ban black users than white users.
Data labelers are also notoriously underpaid. Annotators who contributed captions to ImageNet, one of the best-known open-source computer vision libraries, reportedly earned a median wage of $2 per hour.
Shlomo says it’s up to companies using Dataloop’s tools to influence change — not necessarily Dataloop itself.
“We view the underpayment of annotators as a market failure. Data annotation shares many qualities with software development, one of which is the impact of talent on productivity,” Shlomo said. “[As for bias,] Bias in AI starts with the question the AI developer chooses to ask and the instructions they provide to labeling companies. We call it the “primary bias”. For example, you will never be able to identify color bias unless you ask for skin color in your labeling recipe. The main bias issue is something that industry and regulators should address. Technology alone will not solve the problem.
To date, Dataloop, which has 60 employees, has raised $50 million in venture capital. The company plans to increase its workforce to 80 employees by the end of the year.