The org behind the dataset used to train Stable Diffusion claims it has removed CSAM

LAION, the German research org that created the data used to train Stable Diffusion, among other generative AI models, has released a new dataset that it claims has been “thoroughly cleaned of known links to suspected child sexual abuse material (CSAM).”

The new dataset, Re-LAION-5B, is actually a re-release of an old dataset, LAION-5B — but with “fixes” implemented with recommendations from the nonprofit Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection and the now-defunct Stanford Internet Observatory. It’s available for download in two versions, Re-LAION-5B Research and Re-LAION-5B Research-Safe (which also removes additional NSFW content), both of which were filtered for thousands of links to known — and “likely” — CSAM, LAION says.

“LAION has been committed to removing illegal content from its datasets from the very beginning and has implemented appropriate measures to achieve this from the outset,” LAION wrote in a blog post. “LAION strictly adheres to the principle that illegal content is removed ASAP after it becomes known.”

Important to note is that LAION’s datasets don’t — and never did — contain images. Rather, they’re indexes of links to images and image alt text that LAION curated, all of which came from a different dataset — the Common Crawl — of scraped sites and web pages.

The release of Re-LAION-5B comes after an investigation in December 2023 by the Stanford Internet Observatory that found that LAION-5B — specifically a subset called LAION-5B 400M — included at least 1,679 links to illegal images scraped from social media posts and popular adult websites. According to the report, 400M also contained links to “a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”

While the Stanford co-authors of the report noted that it would be difficult to remove the offending content and that the presence of CSAM doesn’t necessarily influence the output of models trained on the dataset, LAION said it would temporarily take LAION-5B offline.

The Stanford report recommended that models trained on LAION-5B “should be deprecated and distribution ceased where feasible.” Perhaps relatedly, AI startup Runway recently took down its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face; we’ve reached out to the company for more information. (Runway in 2023 partnered with Stability AI, the company behind Stable Diffusion, to help train the original Stable Diffusion model.)

Of the new Re-LAION-5B dataset, which contains around 5.5 billion text-image pairs and was released under an Apache 2.0 license, LAION says that the metadata can be used by third parties to clean existing copies of LAION-5B by removing the matching illegal content.

LAION stresses that its datasets are intended for research — not commercial — purposes. But, if history is any indication, that won’t dissuade some organizations. Beyond Stability AI, Google once used LAION datasets to train its image-generating models.

“In all, 2,236 links [to suspected CSAM] were removed after matching with the lists of link and image hashes provided by our partners,” LAION continued in the post. “These links also subsume 1008 links found by the Stanford Internet Observatory report in December 2023 … We strongly urge all research labs and organizations who still make use of old LAION-5B to migrate to Re-LAION-5B datasets as soon as possible.”

Source link

The org behind the dataset used to train Stable Diffusion claims it has removed CSAM

Recent posts

United and Air Canada can now use Apple AirTags to track lost luggage

Midjourney plans to let anyone on the web edit images with AI

Will Rivian be Volkswagen’s software savior? VW is betting $5.8B it will

The ‘Mozart of Math’ isn’t worried about AI replacing math nerds — ever

This Week in AI: OpenAI’s new Strawberry model may be smart, yet sluggish

Hyundai and Kia recall 208,000 EVs

TechCrunch Disrupt 2025: Only 3 days left for 2-for-1 Pass

At the Microsoft Excel World Championship, selfies and a ‘hype’ tunnel

Every fusion startup that has raised over $100M

WhatsApp scores historic victory against NSO Group in long-running spyware hacking case

WhatsApp introduces contact storing directly within the app, teases usernames

Media giant Lee Enterprises confirms cyberattack as news outlets report ongoing disruption

Boston Dynamics settles patent suit with military robotics firm Ghost

Amazon Kindle Colorsoft review: A muted approach to color

Circular unveils its next gen Ring 2 with ECG functionality and AFib detection at CES 2025

Related articles

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Court filings show Meta staffers discussed using copyrighted content for AI training

Brian Armstrong says Coinbase spent $50M fighting SEC lawsuit – and beat it

iOS 18.4 will bring Apple Intelligence-powered ‘Priority Notifications’

Nvidia CEO Jensen Huang says market got it wrong about DeepSeek’s impact

Report: OpenAI plans to shift compute needs from Microsoft to SoftBank

Norway’s 1X is building a humanoid robot for the home

Sakana walks back claims that its AI can dramatically speed up model training

Company

Follow us