Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material

· Aug 30, 2024 at 2:03 PM

LAION-5B is back, with thousands of links removed after research last year that showed it contained instances of abusive content.

Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material

One of the largest open-source datasets powering generative AI image models is back online after apparently being “cleaned” of child sexual abuse material, following research last year that showed it contained more than 1,008 instances of abusive content.

In December 2023, researchers at the Stanford Internet Observatory found that the LAION-5B machine learning dataset contained more than 3,000 suspected instances of child sexual abuse material (CSAM), one-third of which were externally validated. This meant that anyone who downloaded LAION-5B was in possession of links to CSAM, and that any model trained on the dataset was trained on abusive material.

“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” the Stanford Internet Observatory paper said. “While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims.”

Following the publication of that study, LAION—the Large-scale Artificial Intelligence Open Network, a non-profit organization that creates open-source tools for machine learning—took the dataset down from its own site and Hugging Face, where it was hosted. An investigation by 404 Media showed LAION leadership was aware of the possibility that CSAM could end up in the organization’s datasets: “I guess distributing a link to an image such as child porn can be deemed illegal,” LAION lead engineer Richard Vencu wrote in response to a researcher asking in Discord about how LAION handles potential illegal data that might be included in 5B. “We tried to eliminate such things but there’s no guarantee all of them are out.”

Jenia Jitsev, scientific lead and co-founder at LAION, wrote on Twitter/X on Friday that the re-released is the “first web-scale, text-link to images pair dataset to be thoroughly cleaned of links to suspected CSAM know to our partners IWF and C3P.”

Re-LAION-5B is thus first web-scale, text-link to images pair dataset to be thoroughly cleaned of links to suspected CSAM know to our partners IWF and C3P. We would like to express gratitude to the efforts our partners put into the common work on improving open dataset safety.
— Jenia Jitsev 🏳️‍🌈 🇺🇦 (@JJitsev) August 30, 2024

The newly-reuploaded datasets, called Re-LAION-5B research and Re-LAION-5B research-safe, were completed in partnership with the Internet Watch Foundation, the Canadian Center for Child Protection, and Stanford Internet Observatory, a LAION blog announcement published Friday says.

“In all, 2236 links were removed after matching with the lists of link and image hashes provided by our partners. These links also subsume 1008 links found by the Stanford Internet Observatory report in Dec 2023. Note: A substantial fraction of these links known to IWF and C3P are most likely dead (as organizations make continual efforts to take the known material down from public web), therefore this number is an upper bound for links leading to potential CSAM,” the announcement states.

About the author

Sam Cole is writing from the far reaches of the internet, about sexuality, the adult industry, online culture, and AI. She's the author of How Sex Changed the Internet and the Internet Changed Sex.

More like this

Samantha Cole ,

Joseph Cox ,

Jason Koebler ,

Emanuel Maiberg

· Apr 11, 2025

Account

Navigation

Follow us

Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material