Harvard and Google to release 1 million public-domain books as AI training dataset

Date:

Share post:


AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.

The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”

Harvard first teased the Institutional Data Initiative (IDI) back in March, outlining its plans to create a “trusted conduit for legal data for AI.” However, not much has been heard from it until its formal launch today, which came with confirmation that the IDI includes financial backing from Microsoft and OpenAI.

The IDI’s executive director Greg Leppert says the dataset’s designed to “level the playing field” by opening up such a huge dataset to anyone — from research labs to AI startups — that want to train their large language models (LLMs).



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

Watch Duty was downloaded 2 million times during this week’s LA fires

Fire-tracking app Watch Duty has become a crucial source of information for Los Angeles residents threatened by...

CES 2025: Self-driving cars were everywhere, plus other transportation tech trends

Even before CES 2025 kicked off a few trends began to emerge — or more accurately, some...

Here are the five best pieces of founder advice I learned as a host of Found

After more than two years — and nearly 100 episodes — as a host of TechCrunch’s recently...

Apple may add an iPhone Air to its lineup

Apple’s next major iPhone upgrade will include a new model called the iPhone 17 Air, according to...

How to turn off Apple Intelligence-powered notification summaries

With iOS 18, Apple rolled out Apple Intelligence, which includes an AI-powered feature for summarizing notifications. When...

Open source licenses: Everything you need to know

Open source makes the technology world go ’round, forming as much as 90% of the modern software...

Apple board opposes proposal to abolish DEI programs

Apple’s board of directors has come out in opposition to a proposal seeking to end the company’s...

Researchers open source Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450

So-called reasoning AI models are becoming easier — and cheaper — to develop. On Friday, NovaSky, a...