OpenAI Reportedly Transcribed 1 Million Hours of YouTube Videos to Train GPT-4

OpenAI reportedly transcribed more than one million hours of YouTube videos to train GPT-4, according to The New York Times on Saturday. The report comes just days after YouTube CEO Neal Mohan said transcribing YouTube videos for AI training would be a “clear violation” of its policies in a Bloomberg interview.

“When a creator uploads their hard work to our platform, they have certain expectations. One of those expectations is that the terms of services is going to be abided by,” said Mohan in an interview with Bloomberg last week. “But it does not allow for things like transcripts or video bits to be downloaded.”

The New York Times report alleges that OpenAI team members, including President Greg Brockman, personally helped collect the YouTube videos, according to sources. The article details how OpenAI, and many tech companies, are facing difficulty collecting enough data to train massive AI models. OpenAI allegedly used Whisper, its AI transcription software, to collect more data to train GPT-4, the latest and greatest model underlying ChatGPT.

OpenAI and Google did not immediately respond to Gizmodo’s requests for comment.

The New York Times report could have massive implications for OpenAI and Google’s ongoing battle at the forefront of generative AI development. Google is unlikely to go quietly if OpenAI is using its content to make ChatGPT even greater. However, the company has made no such allegations yet. In a statement to The Verge this weekend, a Google spokesperson merely said he’s “seen unconfirmed reports” about OpenAI’s training.

YouTube’s terms of service prohibit any user from downloading its content, including the use of botnets or scrapers, unless they have clear permissions from the company. YouTube also prohibits utilizing its content for any “independent” uses of its service.

OpenAI’s Chief Technology Officer, Mira Murati, said she was “not sure” whether YouTube videos were used to train her company’s text-to-video AI model Sora when asked by The Wall Street Journal in March. The New York Times report mentions nothing about Sora, or actual YouTube bits themselves. However, her hesitancy to answer this question directly leads to greater speculation.

The New York Times, itself, is in a copyright battle with OpenAI at the moment. OpenAI and Meta are also being sued by a number of authors and content houses for training their AI on copyrighted works.

If these reports are true, it could raise entirely new questions about copyright law in the AI world. Most copyright complaints around AI have been brought by small publishers, but Google could add some real weight behind this fight if it chooses to partake. It would also present a way for Google to slow down OpenAI, which is undoubtedly winning the AI race at the moment.

What's On

Google Pixel 10a dashed my hopes, but I can recommend these 6 phones instead

Sony’s WH-CH720N headphones offer excellent value at full price, but right now they’re a steal.

A $185 motherboard discount is a great way to start a SFF PC build

Doom vs Boom: The Battle to Enshrine AI’s Future Into California Law

Perplexity Is Reportedly Letting Its AI Break a Basic Rule of the Internet

Anthropic Says New Claude 3.5 AI Model Outperforms GPT-4 Omni

Call Centers Introduce ‘Emotion Canceling’ AI as a ‘Mental Shield’ for Workers

AI Turns Classic Memes Into Hideously Animated Garbage

May ‘AI’ Take Your Order? McDonald’s Says Not Yet

Most Popular

The Spectacular Burnout of a Solar Panel Salesman

5 laptops to buy instead of the M4 MacBook Pro

ChatGPT o1 vs. o1-mini vs. 4o: Which should you use?

Our Picks

A sub-$100 Sony ANC headphone deal is hard to ignore

Phil Spencer is leaving Microsoft as Xbox chief as new head draws future strategy

‘Narco-Submarine’ Carrying 4 Tons of Cocaine Captured by Mexico’s Navy

Subscribe to Updates

What's On

OpenAI Reportedly Transcribed 1 Million Hours of YouTube Videos to Train GPT-4

Related Articles

Subscribe to Updates