It turns out that OpenAI used over 1 million hours of YouTube videos to train its AI models

The New York Times has reported that OpenAI has downloaded and used more than one million hours of YouTube videos to train its AI models. Google, which is owned by the same parent company as YouTube, Alphabet, was aware of OpenAI's actions but did not take action because it also uses YouTube videos to train its own AI models.

How Tech Giants Cut Corners to Harvest Data for AI - The New York Times

Google reportedly lets OpenAI transcribe a million hours of YouTube videos to train GPT-4 - Neowin

OpenAI transcribed over a million hours of YouTube videos to train GPT-4 - The Verge

OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models

According to The New York Times, OpenAI ran out of 'trusted English text' on the Internet as of 2021, and new text was needed to develop the next AI.

To that end, we developed 'Whisper,' which enables highly accurate transcription.

OpenAI announces high-performance transcription AI 'Whisper', which supports Japanese and can transcribe tongue twisters and lyrics with high accuracy - GIGAZINE

They then used Whisper to transcribe YouTube videos to obtain training material for the AI.

Although there was some discussion within OpenAI about using YouTube videos, CEO Greg Brockman personally cooperated in collecting the data, and the result was GPT-4.

'GPT-4' announced, top 10% of bar exams & extremely high performance in Japanese & image processing and programming possible, described as 'as shocking as the first iPhone' - GIGAZINE

It is reported that some people at Google were aware of OpenAI's actions, but they took no action because Google, which needed training materials, was also training its own AI models using YouTube videos.

In July 2023, Google is changing its privacy policy to allow all content, such as Google Docs and Google Spreadsheets, to be used for training.

Google announces that it will scrape everything published online for AI purposes - GIGAZINE

Researchers have pointed out that we will run out of training data by 2026, but in reality, we are already exceedingly stretched to this point.

What is the 'data shortage problem' that will cause data used for AI training to run out by 2026? - GIGAZINE

In an email interview with The Verge, OpenAI spokesperson Lindsay Held said that OpenAI curates unique datasets for global research competitiveness, and that it uses a number of sources, including public data and private data from partnerships, and is considering generating its own synthetic data.

Meanwhile, Matt Bryant, a Google spokesman, said, 'Our robots.txt and terms of service prohibit the unauthorized scraping or downloading of YouTube content.'

YouTube CEO Neil Mohan has also made it clear that using YouTube data to train AI is against the rules.

YouTube CEO says 'Using AI for training is against the rules' and 'What's important is that creators succeed on YouTube' - GIGAZINE

in Software,   Web Service, Posted by logc_nt