MLCommons and Hugging Face team up to release massive speech data set for AI research

Date:

Share post:


MLCommons, a nonprofit AI safety working group, has teamed up with AI dev platform Hugging Face to release one of the world’s largest collections of public domain voice recordings for AI research.

The data set, called Unsupervised People’s Speech, contains more than a million hours of audio spanning at least 89 different languages. MLCommons says it was motivated to create it by a desire to support R&D in “various areas of speech technology.”

“Supporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally,” the organization wrote in a blog post Thursday. “We anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.”

It’s an admirable goal, to be sure. But AI data sets like Unsupervised People’s Speech can carry risks for the researchers who choose to use them.

Biased data is one of those risks. The recordings in Unsupervised People’s Speech came from Archive.org, the nonprofit perhaps best known for the Wayback Machine web archival tool. Because many of Archive.org’s contributors are English-speaking — and American — almost all of the recordings in Unsupervised People’s Speech are in American-accented English, per the readme on the official project page.

That means that, without careful filtering, AI systems like speech recognition and voice synthesizer models trained on Unsupervised People’s Speech could exhibit some of the same prejudices. They might, for example, struggle to transcribe English spoken by a non-native speaker, or have trouble generating synthetic voices in languages other than English.

Unsupervised People’s Speech might also contain recordings from people unaware that their voices are being used for AI research purposes — including commercial applications. While MLCommons says that all recordings in the data set are public domain or available under Creative Commons licenses, there’s the possibility mistakes were made.

According to an MIT analysis, hundreds of publicly available AI training data sets lack licensing information and contain errors. Creator advocates including Ed Newton-Rex, the CEO of AI ethics-focused nonprofit Fairly Trained, have made the case that creators shouldn’t be required to “opt out” of AI data sets because of the onerous burden opting out imposes on these creators.

“Many creators (e.g. Squarespace users) have no meaningful way of opting out,” Newton-Rex wrote in a post on X last June. “For creators who can opt out, there are multiple overlapping opt-out methods, which are (1) incredibly confusing and (2) woefully incomplete in their coverage. Even if a perfect universal opt-out existed, it would be hugely unfair to put the opt-out burden on creators, given that generative AI uses their work to compete with them — many would simply not realize they could opt out.”

MLCommons says that it’s committed to updating, maintaining, and improving the quality of Unsupervised People’s Speech. But given the potential flaws, it’d behoove developers to exercise serious caution.



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Autonomous vehicle testing in California dropped 50%. Here’s why.

Tech companies developing self-driving vehicle technology have tapped the brakes on testing on California’s public roads, according...

Mistral board member and a16z VC Anjney Midha says DeepSeek won’t stop AI’s GPU hunger

Andreessen Horowitz general partner and Mistral board member Anjney “Anj” Midha first spied DeepSeek’s jaw-dropping performance six...

Elon Musk is reportedly taking control of the inner workings of US government agencies

People working for, or with, Elon Musk are reportedly taking over the inner workings of multiple government...

Apple will pay $20M to settle Watch battery swelling suit, ‘denies wrongdoing’

Apple has agreed to pay $20 million to resolve a class-action lawsuit over battery swelling on the...

This investor wants you to sign an NDA to build legos together

Investor, former GitHub CEO, and all around Tech Guy™ Nat Friedman has posted a strangely enticing offer...

Custom feed builder Graze is building a business on Bluesky, and investors are paying attention

A startup called Graze, which lets you build your own feeds for the Bluesky social network, has...

AI startup Perplexity sued for alleged trademark infringement

Perplexity, the venture-backed startup building AI-powered search products, has been sued in federal court for allegedly violating...

A brief history of mass-hacks

Enterprise cybersecurity tools, such as routers, firewalls and VPNs, exist to protect corporate networks from intruders and...