Child Sex Abuse Material Was Found In a Major AI Dataset. Researchers Aren’t Surprised.
December 20, 2023Over 1,000 images of sexually abused children have been discovered inside the largest dataset used to train image-generating AI, shocking everyone except for the people who have warned about this exact sort of thing for years.
The dataset was created by LAION, a non-profit organization behind the massive image datasets used by generative AI systems like Stable Diffusion. Following a report from researchers at Stanford University, 404 Media reported that LAION confirmed the presence of child sexual abuse material (CSAM) in the dataset, called LAION-5B, and scrubbed it from their online channels.
The LAION-5B dataset contains links to 5 billion images scraped from the internet.
AI ethics researchers have long warned that the massive scale of AI training datasets makes it effectively impossible to filter them, or audit the AI models that use them. But tech companies, eager to claim their piece of the growing generative AI market, have largely ignored these concerns, building their various products on top of AI models that are trained using these massive datasets. Stable Diffusion, one of the most commonly used text-to-image generation systems, is based on LAION data, for example. And various other AI tools incorporate parts of LAION's datasets in addition to other sources.
This, AI ethics researchers say, is the inevitable result of apathy.
“Not surprising, [to be honest]. We found numerous disturbing and illegal content in the LAION dataset that didn’t make it into our paper,” wrote Abeba Birhane, the lead author of a recent paper examining the enormous datasets, in a tweet responding to the Stanford report. “The LAION dataset gives us a [glimpse] into corp datasets locked in corp labs like those in OpenAI, Meta, & Google. You can be sure, those closed datasets—rarely examined by independent auditors—are much worse than the open LAION dataset.”
LAION told 404 Media that they were removing the dataset “temporarily” in order to remove the CSAM content the researchers identified. But AI experts say the damage is already done.
“It’s sad but really unsurprising,” Sasha Luccioni, an AI and data ethics researcher at HuggingFace who co-authored the paper with Birhane, told Motherboard. “Pretty much all image generation models used some version of [LAION]. And you can’t remove stuff that’s already been trained on it.”
The issue, said Luccioni, is that these massive troves of data aren’t being properly analyzed before they’re used, and the scale of the datasets makes filtering out unwanted material extremely difficult. In other words, even if LAION manages to remove specific unwanted material after it’s discovered, the sheer size of the data means it’s virtually impossible to ensure you’ve gotten rid of all of it—especially if no one cares enough to even try before a product goes to market.
“Nobody wants to work on data because it’s not sexy,” said Luccioni. “Nobody appreciates data work. Everyone just wants to make models go brrr.” ("Go brrr" is a meme referring to a hypothetical money-printing machine).
AI ethics researchers have warned for years about the dangers of AI models and datasets that contain racist and sexist text and images pulled from the internet, with study after study demonstrating how these biases result in automated systems that replicate and amplify discrimination in areas such as healthcare, housing, and policing. The LAION dataset is another example of this “garbage-in, garbage-out” dynamic, where datasets filled with explicit, illegal or offensive material become entrenched in the AI pipeline, resulting in products and software that inherit all of the same issues and biases.
These harms can be mitigated by fine-tuning systems after the fact, to try and prevent them from generating harmful or unwanted outputs. But researchers like Luccioni warn that these technological tweaks don’t actually address the root cause of the problem.
“I think we need to rethink the way we collect and use datasets in AI, fundamentally,” said Luccioni. “Otherwise it’s just technological fixes that don’t solve the underlying issue.”