Close Menu
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

What's On

Realme’s CMO India Talks About 15 Series Tailored for Gen Z With AI Capabilities and How Offline Channel Emerged as a Dark Horse

25 July 2025

Review: Dynabook Portégé Z40L-N

24 July 2025

Samsung Galaxy Z Flip 7 to Support Up to 35 Now Bar Apps and Services By the End of 2025: Report

24 July 2025
Facebook X (Twitter) Instagram
Just In
  • Realme’s CMO India Talks About 15 Series Tailored for Gen Z With AI Capabilities and How Offline Channel Emerged as a Dark Horse
  • Review: Dynabook Portégé Z40L-N
  • Samsung Galaxy Z Flip 7 to Support Up to 35 Now Bar Apps and Services By the End of 2025: Report
  • Amazon Just Dropped Two More Kindle Colorsoft Models
  • iQOO Z10 Turbo Pro+ Spotted on Geekbench; Tipped to Pack 8,000mAh Battery
  • A Lego Game Boy Replica Comes Out This October
  • Cursor’s New Bugbot Is Designed to Save Vibe Coders From Themselves
  • Xiaomi 16 Ultra Leak Suggests Major Upgrades Coming to Rear Camera and Battery
Facebook X (Twitter) Instagram Pinterest Vimeo
Best in TechnologyBest in Technology
  • News
  • Phones
  • Laptops
  • Gadgets
  • Gaming
  • AI
  • Tips
  • More
    • Web Stories
    • Global
    • Press Release
Subscribe
Best in TechnologyBest in Technology
Home » Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content
News

Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

News RoomBy News Room20 March 20244 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
Share
Facebook Twitter LinkedIn Pinterest Email

In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.

Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.

A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.

“There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image generation startup Stability AI because he disagreed with its policy of scraping content without permission.

Fairly Trained offers a certification to companies willing to prove that they’ve trained their AI models on data that they either own, have licensed, or is in the public domain. When the nonprofit launched, some critics pointed out that it hadn’t yet identified a large language model that met those requirements.

Today, Fairly Trained announced it has certified its first large language model. It’s called KL3M and was developed by Chicago-based legal tech consultancy startup 273 Ventures, using a curated training dataset of legal, financial, and regulatory documents.

The company’s cofounder Jillian Bommarito says the decision to train KL3M in this way stemmed from the company’s “risk-averse” clients like law firms. “They’re concerned about the provenance, and they need to know that output is not based on tainted data,” she says. “We’re not relying on fair use.” The clients were interested in using generative AI for tasks like summarizing legal documents and drafting contracts, but didn’t want to get dragged into lawsuits about intellectual property as OpenAI, Stability AI, and others have been.

Bommarito says that 273 Ventures hadn’t worked on a large language model before but decided to train one as an experiment. “Our test to see if it was even possible,” she says. The company has created its own training data set, the Kelvin Legal DataPack, which includes thousands of legal documents reviewed to comply with copyright law.

Although the dataset is tiny (around 350 billion tokens, or units of data) compared to those compiled by OpenAI and others that have scraped the internet en masse, Bommarito says the KL3M model performed far better than expected, something she attributes to how carefully the data had been vetted beforehand. “Having clean, high-quality data may mean that you don’t have to make the model so big,” she says. Curating a dataset can help make a finished AI model specialized to the task its designed for. 273 Ventures is now offering spots on a waitlist to clients who want to purchase access to this data.

Clean Sheet

Companies looking to emulate KL3M may have more help in the future in the form of freely available infringement-free datasets. On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed purely of public domain content. Common Corpus, as it is called, is a collection of text roughly the same size as the data used to train OpenAI’s GPT-3 text generation model and has been posted to the open source AI platform Hugging Face.

The dataset was built from sources like public domain newspapers digitized by the US Library of Congress and the National Library of France. Pierre-Carl Langlais, project coordinator for Common Corpus, calls it a “big enough corpus to train a state-of-the-art LLM.” In the lingo of big AI, the dataset contains 500 million tokens, OpenAI’s most capable model is widely believed to have been trained on several trillions.

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
Previous ArticleInfinix Note 40 Pro 5G Series India Launch Set for April, Flipkart Teaser Confirms
Next Article Colorado vs Boise State live stream: Watch the First Four for free

Related Articles

News

Review: Dynabook Portégé Z40L-N

24 July 2025
News

Amazon Just Dropped Two More Kindle Colorsoft Models

24 July 2025
News

Cursor’s New Bugbot Is Designed to Save Vibe Coders From Themselves

24 July 2025
News

The Very Real Case for Brain-Computer Implants

24 July 2025
News

A Luggage Service’s Web Bugs Exposed the Travel Plans of Every User—Including Diplomats

24 July 2025
News

The ICJ Rules That Failing to Combat Climate Change Could Violate International Law

24 July 2025
Demo
Top Articles

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024103 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views

Oppo Reno 14, Reno 14 Pro India Launch Timeline and Colourways Leaked

27 May 202582 Views

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Latest News
Gaming

A Lego Game Boy Replica Comes Out This October

News Room24 July 2025
News

Cursor’s New Bugbot Is Designed to Save Vibe Coders From Themselves

News Room24 July 2025
Phones

Xiaomi 16 Ultra Leak Suggests Major Upgrades Coming to Rear Camera and Battery

News Room24 July 2025
Most Popular

The Spectacular Burnout of a Solar Panel Salesman

13 January 2025125 Views

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

15 December 2024103 Views

Costco partners with Electric Era to bring back EV charging in the U.S.

28 October 202495 Views
Our Picks

Amazon Just Dropped Two More Kindle Colorsoft Models

24 July 2025

iQOO Z10 Turbo Pro+ Spotted on Geekbench; Tipped to Pack 8,000mAh Battery

24 July 2025

A Lego Game Boy Replica Comes Out This October

24 July 2025

Subscribe to Updates

Get the latest tech news and updates directly to your inbox.

Facebook X (Twitter) Instagram Pinterest
  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact Us
© 2025 Best in Technology. All Rights Reserved.

Type above and press Enter to search. Press Esc to cancel.