YouTuber files class action suit over OpenAI’s scrape of creators’ transcripts

Date:

Share post:


A YouTube creator is seeking to bring a class action lawsuit against OpenAI, alleging that the company trained its generative AI models on millions of transcripts from YouTube videos without notifying or compensating the videos’ owners.

In a complaint filed Friday in the U.S. District Court for the Northern District of California, attorneys for David Millette, a YouTube user based in Massachusetts, allege that OpenAI surreptitiously transcribed Millette’s and other creators’ videos to train the models that power the company’s AI-powered chatbot platform, ChatGPT, and other generative AI tools and products. By collecting this data, OpenAI “profited significantly” from the creators’ work, the complaint alleges, while violating copyright law and YouTube’s terms of service that prohibit the use of videos for apps independent of its service.

“As [OpenAI’s] AI products become more sophisticated through the use of training data sets, they become more valuable to prospective and current users, who purchase subscriptions to access [OpenAI’s] AI products,” the complaint reads. “Much of the material in OpenAI’s training data sets, however, comes from works that were copied by OpenAI without consent, without credit, and without compensation.”

Millette, represented by the law firm Bursor and Fisher, is seeking a jury trial and over $5 million in damages for all YouTube users whose data might’ve been swept up in OpenAI’s training.

Generative AI models like OpenAI’s have no real intelligence. Fed an enormous number of examples (e.g. movies, voice recordings, essays and so on), models “learn” how likely data is to occur based on patterns, including the context of any surrounding data.

Most models are trained on data sourced from public websites and data sets around the web. Companies argue that fair use shields their efforts to scrape data indiscriminately and use it for training commercial models. Many copyright holders disagree, however — and they’re filing suits aimed at halting practice.

Video transcriptions have become a key training data ingredient as other data wells dry up, so to speak.

More than 35% of the world’s top 1,000 websites now block OpenAI’s web crawler, according to data from Originality.AI. And around 25% of data from “high-quality” sources has been restricted from the major data sets used to train AI models, a study by MIT’s Data Provenance Initiative found. Should the current access-blocking trend continue, the research group Epoch AI predicts that developers will run out of data to train generative AI models between 2026 and 2032.

In April, The New York Times reported that OpenAI created its first speech recognition model, Whisper, for the purpose of transcribing audio from videos to collect additional training data. An OpenAI team that included company’s president, Greg Brockman, transcribed more than a million hours of video from YouTube using Whisper, according to The Times, and used the transcripts to train OpenAI’s text-generating and -analyzing model GPT-4.

Some OpenAI staffers discussed how such a move might go against YouTube’s rules, per The Times.

In July, Proof News reported that companies including Anthropic, Apple, Salesforce and Nvidia used a data set called The Pile, which contains subtitles from hundreds of thousands of YouTube videos, to train generative AI models. Many YouTube creators whose subtitles were swept up in The Pile weren’t aware of and didn’t consent to this; Apple later released a statement saying that it didn’t intend to use those models to power any AI features in its products.

Google, YouTube’s parent company, has also sought to use transcripts to train its models.

Last year, Google broadened its terms of service (ToS) partly to allow the company to tap more user data for generative AI model training. Under the old ToS, it wasn’t clear whether Google could use YouTube data to build products beyond the video platform. Not so under the new terms, which loosen the reins considerably. 

We’ve reached out to OpenAI and Google for comment on the class action suit and will update this piece if they respond.

It’s been a rough start to the month for OpenAI.

Tesla and X CEO Elon Musk on Monday filed a new suit against OpenAI and CEO Sam Altman accusing the company of abandoning its original nonprofit mission by reserving some of its most sophisticated tech for commercial customers. Musk made the same claims in a February lawsuit against OpenAI, but the new suit alleges that OpenAI is engaging in racketeering activity, as well.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

A company is now developing human washing machines

Forget cold plunges. The new flex could soon be human washing machines. According to one of Japan’s oldest...

Texas AG opens investigation into advertising group that Elon Musk sued for ‘boycotting’ X

Texas Attorney General Ken Paxton announced on Thursday he is opening an investigation into the World Federation...

American Airlines is deploying new tech to shame boarding line cutters

American Airlines has a new tactic for shaming boarding line cutters: A loud beeper. CNBC reports that the...

Hackers break into Andrew Tate’s online ‘university,’ steal user data and flood chats with emojis

Hackers have breached an online course founded by ostensible influencer and self-described misogynist Andrew Tate, leaking data...

Apple is reportedly building a more conversational Siri powered by LLMs

Apple is developing a new version of its voice assistant, Siri, powered by advanced large language models...

Future Google supplier Kairos gets approval to build two small nuclear reactors

Nuclear startup Kairos Power received approval from the U.S. Nuclear Regulatory Commission to start construction on two...

Zepto raises another $350 million amid retail upheaval in India

Zepto has secured $350 million in new funding, its third round of financing in six months, as...

YouTube Shorts’ Dream Screen feature can now generate AI video backgrounds

YouTube announced on Thursday that its Dream Screen feature for Shorts now lets you create AI-generated video...