Close Menu
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

What's On

From Orwell 2+2=5 to Frankenstein: TIFF’s Films on Power, Creation, and Survival Are a Warning

14 September 2025

Winter Is Coming. Here’s How to Keep Your Houseplants Alive

14 September 2025

Review: Razer BlackShark V3 Pro Headset

14 September 2025
Facebook X (Twitter) Instagram
Just In
  • From Orwell 2+2=5 to Frankenstein: TIFF’s Films on Power, Creation, and Survival Are a Warning
  • Winter Is Coming. Here’s How to Keep Your Houseplants Alive
  • Review: Razer BlackShark V3 Pro Headset
  • Review: Hypershell Pro X Series
  • How to Switch to Google Fi
  • The Quest to Find the Longest-Running Simple Computer Program
  • How a 2020 Rolex Collection Changed the Face of Watch Design
  • Gear News of the Week: Google’s Next-Gen Nest Cams Are Coming, and Sony Debuts a New Xperia Phone
Facebook X (Twitter) Instagram Pinterest Vimeo
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release
Subscribe
Best in TechnologyBest in Technology
Home » Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
News

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

News RoomBy News Room12 December 20243 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, and the company has pledged its support.

However IDI’s dataset is released, it will be joining a host of similar projects, startups, and initiatives that promise to give companies access to substantial and high-quality AI training materials without the risk of running into copyright issues. Firms like Calliope Networks and ProRata have emerged to issue licenses and design compensation schemes designed to get creators and rightholders paid for providing AI training data.

There are also other new public-domain projects. Last spring, the French AI startup Pleias rolled out its own public-domain dataset, Common Corpus, which contains an estimated 3 to 4 million books and periodical collections, according to project coordinator Pierre-Carl Langlais. Backed by the French Ministry of Culture, the Common Corpus has been downloaded over 60,000 times this month alone on the open source AI platform Hugging Face. Last week, Pleias announced that it is releasing its first set of large language models trained on this dataset, which Langlais told WIRED constitute the first models “ever trained exclusively on open data and compliant with the [EU] AI Act.”

Efforts are underway to create similar mage datasets as well. AI startup Spawning released its own this summer called Source.Plus, which contains public-domain images from Wikimedia Commons as well as a variety of museums and archives. Several significant cultural institutions have long made their own archives accessible to the public as standalone projects, like the Metropolitan Museum of Art.

Ed Newton-Rex, a former executive at Stability AI who now runs a nonprofit that certifies ethically-trained AI tools, says the rise of these datasets shows that there’s no need to steal copyrighted materials to build high-performing and quality AI models. OpenAI previously told lawmakers in the United Kingdom that it would be “impossible” to create products like ChatGPT without using copyrighted works. “Large public domain datasets like these further demolish the ‘necessity defense’ some AI companies use to justify scraping copyrighted work to train their models,” Newton-Rex says.

But he still has reservations about whether the IDI and projects like it will actually change the training status quo. “These datasets will only have a positive impact if they’re used, probably in conjunction with licensing other data, to replace scraped copyrighted work. If they’re just added to the mix, one part of a dataset that also includes the unlicensed life’s work of the world’s creators, they’ll overwhelmingly benefit AI companies,” he says.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInfinix Hot 50, Hot 50 Pro, Hot 50 Pro+ Now Available in New Colour Options
Next Article I tried 4 of the best earbud and phone combos. Here’s which one you should use

Related Articles

News

From Orwell 2+2=5 to Frankenstein: TIFF’s Films on Power, Creation, and Survival Are a Warning

14 September 2025
News

Winter Is Coming. Here’s How to Keep Your Houseplants Alive

14 September 2025
News

Review: Razer BlackShark V3 Pro Headset

14 September 2025
News

Review: Hypershell Pro X Series

14 September 2025
News

How to Switch to Google Fi

14 September 2025
News

The Quest to Find the Longest-Running Simple Computer Program

14 September 2025
Demo
Top Articles

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024105 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views

5 laptops to buy instead of the M4 MacBook Pro

17 November 202492 Views

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Latest News
News

The Quest to Find the Longest-Running Simple Computer Program

News Room14 September 2025
News

How a 2020 Rolex Collection Changed the Face of Watch Design

News Room13 September 2025
News

Gear News of the Week: Google’s Next-Gen Nest Cams Are Coming, and Sony Debuts a New Xperia Phone

News Room13 September 2025
Most Popular

The Spectacular Burnout of a Solar Panel Salesman

13 January 2025129 Views

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024105 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views
Our Picks

Review: Hypershell Pro X Series

14 September 2025

How to Switch to Google Fi

14 September 2025

The Quest to Find the Longest-Running Simple Computer Program

14 September 2025

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Facebook X (Twitter) Instagram Pinterest
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact Us
© 2025 Best in Technology. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.