Is it possible to train an AI on data generated only by another AI? Sounds like a harebrained idea. It is, but it's been around for quite some time now-and as new, real data becomes increasingly hard to come by, it's been gaining traction.
Anthropic used some synthetic data to train one of its flagship models, Claude 3.5 Sonnet. Meta fine-tuned its Llama 3.1 models using AI-generated data. And OpenAI is said to be sourcing synthetic training data from o1, its "reasoning" model, for the upcoming Orion.
But why is AI in a position where it apparently needs data in the first place-and what kind of data does it need? And can this data really be replaced by synthetic data?
The role of annotations
AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions, like that "to whom" in an email typically precedes "it may concern."
Annotations are a key piece in these examples-they represent a form of text labeling the meaning or parts of the data that these systems ingest. A guidepost in a certain sense, "teaching" a model to distinguish among things, places, and ideas.
Suppose we have a model for classifying photos that is trained on a lot of images of kitchens, where the word "kitchen" is annotated. At learning time, it will start to generalize "kitchen" to general features about kitchens, such as they contain fridges and countertops. At test time, given an image of a kitchen it hadn't seen at training time, the model should classify it as a kitchen. (But then, if the photos of kitchens were listed "cow", it would tag them as cows, which underlines the importance of good annotation.)
The hunger for AI and the need to offer the training data necessary for developing it have exploded the market for annotations. Dimension Market Research puts it at $838.2 million today—and a whopping $10.34 billion in the next ten years. There's no specific estimate of how many people actually do labeling work, but a 2022 paper puts the number in the "millions."
Companies big and small rent the workers of data annotation firms to label AI training sets. Some of these jobs pay pretty well, especially if doing the labeling requires specialized knowledge, such as math expertise. Some are grueling. Annotators in developing countries get a few dollars an hour, with no benefits or guarantees of future gigs.
A drying data well
This leads to some humanistic reasons to seek out alternatives to human-generated labels. But there are also practical ones.
Humans only can label so fast. Annotators also have biases that can manifest in their annotations, and, consequently, any models trained on them. Annotators make mistakes, or get tripped up by labeling instructions. And paying humans to do things is expensive.
Data, as a matter of fact, is expensive in general. Shutterstock is charging AI vendors tens of millions of dollars to access its archives, while Reddit has made hundreds of millions from licensing data to Google, OpenAI, and others.
Data also getting harder to get.
Most models are trained on vast repositories of publicly available data - data whose owners are increasingly locking up in fears their data will be plagiarized or that they won't receive proper credit or attribution for it. More than 35% of the world's top 1,000 websites now block OpenAI's web scraper. And some 25 percent of data from "high-quality" sources has been off-limits to the major datasets used to train models, one recent study found.
Should this pattern of access denial to current databases continue, new research group Epoch AI predicts developers will run out of data to train generative AI models between 2026 and 2032. Fears of copyright lawsuits and objectionable material making their way into open data sets pushed AI vendors to face an uncomfortable reckoning.
Synthetic alternatives
Synthetic data appears to solve all of these problems for you. You need some annotations? Generate 'em. More example data? Here you go. You can ask it for all you want.
And to a large extent, this is true.
"If 'data is the new oil,' synthetic data pitches itself as biofuel, creatable without the negative externalities of the real thing," Os Keyes, a PhD candidate at the University of Washington who studies the ethical impact of emerging technologies, told TechCrunch. "You can take a small starting set of data and simulate and extrapolate new entries from it."
The AI industry has run with the idea.
This month, enterprise-focused generative AI company Writer launched a model, Palmyra X 004, which it said was trained nearly entirely on synthetic data. It cost just $700,000 to build, Writer claims, compared to estimates of $4.6 million for a comparably-sized OpenAI model.
For example, Microsoft's Phi open models were trained in part on synthetic data. So did the Gemma models by Google. This summer, a family of models to generate synthetic training data was unveiled by Nvidia and, meanwhile, AI startup Hugging Face recently released what it says is the largest AI training dataset of synthetic text.
Synthetic data generation has become a full-fledged industry – and one that by 2030 could be worth $2.34 billion in size. Gartner says 60 percent of the data fed into AI and analytics projects this year will be synthetically created.
Luca Soldaini, senior research scientist, Allen Institute for AI: "Synthetic data techniques can be leveraged to synthesize training data in a form that would otherwise be challenging to acquire through scraping, even under licensing. While training its video generator, Movie Gen, Meta came up with captions of a clip in the training data by using Llama 3, which humans further refined to include more details, like light descriptions.".
Along those lines, OpenAI said it fine-tuned GPT-4o using synthetic data to build the sketchpad-like Canvas feature for ChatGPT. And Amazon has said it generates synthetic data to supplement the real-world data it uses to train speech recognition models for Alexa.
"Synthetic data models can be used to quickly expand upon human intuition of which data is needed to achieve a specific model behavior," Soldaini said.
Synthetic risks
Synthetic data is no panacea, however. It suffers from the same "garbage in, garbage out" problem as all AI. Models create synthetic data, and if the data used to train these models has biases and limitations, their outputs will be similarly tainted. For example, minorities and other groups poorly represented in the base data will be so in the synthetic data.
"The problem is, you can only do so much," Keyes said. "Say you only have 30 Black people in a dataset. Extrapolating out might help, but if those 30 people are all middle-class, or all light-skinned, that's what the 'representative' data will all look like."
To date, a 2023 paper by researchers at Rice University and Stanford found that over-reliance on synthetic data during training can create models whose "quality or diversity progressively decrease." Sampling bias — poor representation of the real world — causes a model's diversity to worsen after a few generations of training, according to the researchers (although they also found that mixing in a bit of real-world data helps to mitigate this).
But he has concerns that in models like OpenAI's o1, there is something more threatening, and that is that complicated models will create harder-to-identify hallucinations inside the artificial data they use for training. Those hallucinations could lower the accuracy of those models- especially where sources are unknown.
"Complex models hallucinate; the data generated by complex models contain hallucinations," Keyes continued. "And, in the case of a model like o1, even the creators can't say for sure why artefacts appear."
Compounding hallucinations can result in gibberish-spewing models. Published in the journal Nature, a study reveals how models, trained on error-ridden data, generate even more error-ridden data and how this makes for a feedback loop that degrades future generations of models. The researchers found models lose their grasp of more esoteric knowledge over generations — becoming more generic and often producing answers irrelevant to the questions they're asked.
A follow-up study proves that other types of models, for instance, image generators, are not resistant to a similar collapse:
Soldaini acknowledges that "raw" synthetic data can't be trusted, at least if the aim is not to train forgetful chatbots and homogenous image generators. He says "using it safely requires reviewing and curating through all filtering," and preferably accompanying that with new, real data - just like you would with any other dataset.
Failure to do so could eventually lead to model collapse, wherein a model becomes less "creative" — and more biased — in its outputs, eventually seriously compromising its functionality. Even though this process could be identified and arrested before it gets serious, it is a risk.
"Researchers have to analyze the produced data, refine the process of how generation occurs and find controls for filtering out the bad data points," Soldaini said. "Synthetic data pipelines are not a self-improving machine; their output has to be very carefully inspected and improved before it's good enough to be used for training.".
OpenAI CEO Sam Altman has argued that, eventually, AI systems will be able to generate synthetic data good enough to train themselves effectively. However — assuming that's even possible, which it may not be — the technology isn't yet there. To date, no major AI lab has released a model trained on synthetic data alone.
At least for the foreseeable future, it looks like we'll need humans in the loop somewhere to ensure that a model's training does not go awry.