OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit (updated)

Lawyers for The New York Times and Daily News, which are suing OpenAI for allegedly scraping their works to train its AI models without permission, say OpenAI engineers accidentally deleted data potentially relevant to the case.

Earlier this fall, OpenAI agreed to provide two virtual machines so that counsel for The Times and Daily News could perform searches for their copyrighted content in its AI training sets. (Virtual machines are software-based computers that exist within another computer’s operating system, often used for the purposes of testing, backing up data, and running apps.) In a letter, attorneys for the publishers say that they and experts they hired have spent over 150 hours since November 1 searching OpenAI’s training data.

But on November 14, OpenAI engineers erased all the publishers’ search data stored on one of the virtual machines, according to the aforementioned letter, which was filed in the U.S. District Court for the Southern District of New York late Wednesday.

OpenAI tried to recover the data — and was mostly successful. However, because the folder structure and file names were “irretrievably” lost, the recovered data “cannot be used to determine where the news plaintiffs’ copied articles were used to build [OpenAI’s] models,” per the letter.

“News plaintiffs have been forced to recreate their work from scratch using significant person-hours and computer processing time,” counsel for The Times and Daily News wrote. “The news plaintiffs learned only yesterday that the recovered data is unusable and that an entire week’s worth of its experts’ and lawyers’ work must be re-done, which is why this supplemental letter is being filed today.”

The plaintiffs’ counsel makes clear that they have no reason to believe the deletion was intentional. But they do say the incident underscores that OpenAI “is in the best position to search its own datasets” for potentially infringing content using its own tools.

An OpenAI spokesperson declined to provide a statement.

But late Friday, November 22, counsel for OpenAI filed a response to the letter sent by lawyers for The Times and Daily News on Wednesday. In their response, OpenAI’s attorneys unequivocally denied that OpenAI deleted any evidence, and instead suggested that the plaintiffs were to blame for a system misconfiguration that led to a technical issue.

“Plaintiffs requested a configuration change to one of several machines that OpenAI has provided to search training datasets,” OpenAI’s counsel wrote. “Implementing plaintiffs’ requested change, however, resulted in removing the folder structure and some file names on one hard drive — a drive that was supposed to be used as a temporary cache … In any event, there is no reason to think that any files were actually lost.”

In this case and others, OpenAI has maintained that training models using publicly available data — including articles from The Times and Daily News — is fair use. In other words, in creating models like GPT-4o, which “learn” from billions of examples of e-books, essays, and more to generate human-sounding text, OpenAI believes that it isn’t required to license or otherwise pay for the examples — even if it makes money from those models.

That being said, OpenAI has inked licensing deals with a growing number of new publishers, including the Associated Press, Business Insider owner Axel Springer, Financial Times, People parent company Dotdash Meredith, and News Corp. OpenAI has declined to make the terms of these deals public, but one content partner, Dotdash, is reportedly being paid at least $16 million per year.

OpenAI has neither confirmed nor denied that it trained its AI systems on any specific copyrighted works without permission.

Update: Added OpenAI’s response to the allegations.

Source link

OpenAI accidentally deleted potential evidence in NY Times copyright lawsuit (updated)

Recent posts

US agency proposes new rule blocking data brokers from selling Americans’ sensitive personal data

Scope3 starts tracking the carbon footprint of AI

Mastodon sees a boost from the ‘X exodus,’ too, founder says

How Tesla’s plans for ‘unsupervised FSD’ and robotaxis could run into red tape

Meta hires Salesforce’s CEO of AI, Clara Shih, to lead new business AI group

TechCrunch Disrupt 2025 tickets now on sale: Lowest rates ever

Hydrogen tax credit rules give startups clarity while boosting nuclear and carbon capture

The real reason why oil and gas companies are bullish on carbon capture

OpenAI snatches up Microsoft generative AI research lead

Three reasons every founder and VC should be at TechCrunch All Stage 2025

These are the top apps Gen Z young adults downloaded this year

Despite recent layoffs, Meta is expanding in India

Bird cuts 120 jobs as part of ‘strategic realignment’

Tesla Dojo: Elon Musk’s big plan to build an AI supercomputer, explained

Geoff Hinton and John Hopfield win Nobel Prize in Physics for their work in foundational AI

Related articles

Did xAI lie about Grok 3’s benchmarks?

US AI Safety Institute could face big cuts

How I Podcast: Summer Album / Winter Album’s Jody Avirgan

The pain of discontinued items, and the thrill of finding them online

The fallout from HP’s Humane acquisition

Trump administration reportedly shutting down federal EV chargers nationwide

Explore the online world of Apple TV’s ‘Severance’

Meta, X approved ads containing violent anti-Muslim, antisemitic hate speech ahead of German election, study finds

Company

Follow us