Harvard and Google to release 1 million public-domain books as AI training dataset

Date:

Share post:


AI training data has a big price tag, one best-suited for deep-pocketed tech firms. This is why Harvard University plans to release a dataset that includes in the region of 1 million public-domain books, spanning genres, languages, and authors including Dickens, Dante, and Shakespeare, which are no longer copyright-protected due to their age.

The new dataset isn’t available yet, and it’s not clear when or how it will be released. However, it contains books derived from Google’s longstanding book-scanning project, Google Books, and thus Google will be involved in releasing “this treasure trove far and wide.”

Harvard first teased the Institutional Data Initiative (IDI) back in March, outlining its plans to create a “trusted conduit for legal data for AI.” However, not much has been heard from it until its formal launch today, which came with confirmation that the IDI includes financial backing from Microsoft and OpenAI.

The IDI’s executive director Greg Leppert says the dataset’s designed to “level the playing field” by opening up such a huge dataset to anyone — from research labs to AI startups — that want to train their large language models (LLMs).



Source link

Lisa Holden
Lisa Holden
Lisa Holden is a news writer for LinkDaddy News. She writes health, sport, tech, and more. Some of her favorite topics include the latest trends in fitness and wellness, the best ways to use technology to improve your life, and the latest developments in medical research.

Recent posts

Related articles

United and Air Canada can now use Apple AirTags to track lost luggage

Arriving in time for holiday travel and potentially lost luggage, a new feature that allows Apple AirTag...

Realize Music’s ‘Sing’ wants you to sing out loud by yourself in VR because it just feels good

People don’t sing as much as they used to, and Mike Wilson wants to change that. The...

Cartesia claims its AI is efficient enough to run pretty much anywhere

It’s becoming increasingly costly to develop and run AI. OpenAI’s AI operations costs could reach $7 billion this year,...

A new $25M fund aims to give UK mental health startups a shot in the arm

In case you hadn’t noticed, there are a lot of mental health startups out there these days,...

Embedded data analytics startup Embeddable is still handpicking its customers despite strong demand

Tom Gardiner and Harry Marshall founded Trevor.io in 2016 as a no-code business intelligence platform to help...

Bitcoin ATM giant Byte Federal says 58,000 users’ personal data compromised in breach

Byte Federal, one of the largest Bitcoin ATM operators in the U.S., said the personal data of...

Anybotics raises $60M to bring more autonomous industrial robots to the U.S.

Swiss robotics company Anybotics has raised an extra $60 million to close its Series B round off...

Upvest, a stock trading API used by N26, Revolut and others, raises $105 million

Upvest might not be a familiar name if you don’t pay close attention to the fintech industry,...