to optimize data curation for AI.
All machine learning models must be bound by one critical factor: the quality of the data from which they are trained.
Data curation is a challenging task to improve machine learning and AI models. A 2021 MIT research study found systemic issues in how training data was labeled, leading to inaccurate outcomes in AI systems. A study in the journal Quantitative Science Studies that analyzed 141 prior investigations into data labeling found that 41% of models were using datasets that had been labeled by humans.
Among the vendors trying to tackle the challenge of optimizing data curation for AI is a Swiss startup, Lightly. Founded in 2019, the company announced this week that it has raised $3 million in a seed round of funding. Lightly isn’t looking to be a data-labeling vendor, however. The company is instead looking to curate data with a self-supervised model of machine learning that could eventually reduce the need for data labeling altogether.
” I am still amazed at how much of machine learning’s work is manual, tedious, and not automated,” Matthias Heller, cofounder at Lightly, said to VentureBeat. “People believe machine learning is so advanced. But machine learning and deep-learning are still very young technologies and much of the infrastructure and tooling is only now .”
A growing market for data curation and data labeling
There’s plenty of money and vendors available to optimize data for machine-learning, data curation data labeling.
For example, Defined.ai, which was known as DefinedCrowd before rebranding in 2021, has raised $78 million to date to help advance its data curation vision.
And Grand View Research has forecasted that the data labeling market will reach $8.2 billion by 2028, with a projected compound annual growth rate of 24.6% between 2021 and 2028. VentureBeat’s own list of the top data labeling software vendors includes Appen’s Figure Eight, Amazon Sagemaker Ground Truth, SuperAnnotate, Dataloop and V7’s Darwin.
Other popular vendors include Labelbox, the open-source Labelstudio and Labelbox. Both can be integrated with Lightly’s technology. Lightly’s open approach means that users can integrate the company’s technology into any labeling vendor.
How the self-supervised model works
Three years ago, Heller and his cofounder Igor Susmelj were working on a machine learning project which required them to label their data.
” We were always curious if the data we were labeling actually improves the model,” Heller stated.
This led to Lightly which also includes a number of open-source projects. The primary project is the Lightly library, which provides a self-supervised approach to machine learning on images.
There are many ways to train data for machine-learning, Heller explained. In a supervised approach, such as with com