Advertisement
News

Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material

LAION-5B is back, with thousands of links removed after research last year that showed it contained instances of abusive content. 
Massive AI Dataset Back Online After Being ‘Cleaned’ of Child Sexual Abuse Material

One of the largest open-source datasets powering generative AI image models is back online after apparently being “cleaned” of child sexual abuse material, following research last year that showed it contained more than 1,008 instances of abusive content. 

In December 2023, researchers at the Stanford Internet Observatory found that the LAION-5B machine learning dataset contained more than 3,000 suspected instances of child sexual abuse material (CSAM), one-third of which were externally validated. This meant that anyone who downloaded LAION-5B was in possession of links to CSAM, and that any model trained on the dataset was trained on abusive material.

Largest Dataset Powering AI Images Removed After Discovery of Child Sexual Abuse Material
The model is a massive part of the AI-ecosystem, used by Stable Diffusion and other major generative AI products. The removal follows discoveries made by Stanford researchers, who found thousands instances of suspected child sexual abuse material in the dataset.

“We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” the Stanford Internet Observatory paper said. “While the amount of CSAM present does not necessarily indicate that the presence of CSAM drastically influences the output of the model above and beyond the model’s ability to combine the concepts of sexual activity and children, it likely does still exert influence. The presence of repeated identical instances of CSAM is also problematic, particularly due to its reinforcement of images of specific victims.”

Following the publication of that study, LAION—the Large-scale Artificial Intelligence Open Network, a non-profit organization that creates open-source tools for machine learning—took the dataset down from its own site and Hugging Face, where it was hosted. An investigation by 404 Media showed LAION leadership was aware of the possibility that CSAM could end up in the organization’s datasets: “I guess distributing a link to an image such as child porn can be deemed illegal,” LAION lead engineer Richard Vencu wrote in response to a researcher asking in Discord about how LAION handles potential illegal data that might be included in 5B. “We tried to eliminate such things but there’s no guarantee all of them are out.” 

Jenia Jitsev, scientific lead and co-founder at LAION, wrote on Twitter/X on Friday that the re-released is the “first web-scale, text-link to images pair dataset to be thoroughly cleaned of links to suspected CSAM know to our partners IWF and C3P.”  

The newly-reuploaded datasets, called Re-LAION-5B research and Re-LAION-5B research-safe, were completed in partnership with the Internet Watch Foundation, the Canadian Center for Child Protection, and Stanford Internet Observatory, a LAION blog announcement published Friday says. 

“In all, 2236 links were removed after matching with the lists of link and image hashes provided by our partners. These links also subsume 1008 links found by the Stanford Internet Observatory report in Dec 2023. Note: A substantial fraction of these links known to IWF and C3P are most likely dead (as organizations make continual efforts to take the known material down from public web), therefore this number is an upper bound for links leading to potential CSAM,” the announcement states.

Advertisement