Harvard and Google to release 1 million public-domain books as AI training dataset

Date:

Share post:


AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.

The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”

Harvard first teased the Institutional Data Initiative (IDI) back in March, outlining its plans to create a “trusted conduit for legal data for AI.” However, not much has been heard from it until its formal launch today, which came with confirmation that the IDI includes financial backing from Microsoft and OpenAI.

The IDI’s executive director Greg Leppert says the dataset’s designed to “level the playing field” by opening up such a huge dataset to anyone — from research labs to AI startups — that want to train their large language models (LLMs).



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Shopify took down Kanye’s swastika T-shirt shop, but another antisemitic storefront still operates

Shopify took down Kanye West’s online store after the musician sold T-shirts with the swastika symbol. West, who...

Federal workers sue Elon Musk and DOGE to cut off data access

More than 100 current and former federal workers have sued Elon Musk and the Department of Government...

Founders Fund is about to close another $3B fund

Founders Fund is on track to conclude fundraising of its third growth fund at the end of...

Apple Maps plans to show ‘Gulf of America,’ following Google

Apple Maps will soon rename the Gulf of Mexico to the Gulf of America, following similar changes...

ChatGPT may not be as power-hungry as once assumed

ChatGPT, OpenAI’s chatbot platform, may not be as power-hungry as once assumed. But its appetite largely depends...

Amazon tests sending customers directly to brands’ websites when it doesn’t stock their products

Remember that Christmas movie “Miracle on 34th Street,” where Macy’s in-store Santa, Kris Kringle, sends a frazzled...

Google’s I/O developer conference set for May 20-21

Google Tuesday confirmed that its annual developer conference is set for May 20-21, 2025. The event will...

Microsoft powers AI ambitions with 400 MW solar purchase

Microsoft has added another 389 megawatts of renewable power to its portfolio as the tech giant scrambles...