It turns out that OpenAI used over 1 million hours of YouTube videos to train its AI models

The New York Times has reported that OpenAI has downloaded and used more than one million hours of YouTube videos to train its AI models. Google, which is owned by the same parent company as YouTube, Alphabet, was aware of OpenAI's actions but did not take action because it also uses YouTube videos to train its own AI models.

According to The New York Times, OpenAI ran out of 'trusted English text' on the Internet as of 2021, and new text was needed to develop the next AI.

To that end, we developed 'Whisper,' which enables highly accurate transcription.

They then used Whisper to transcribe YouTube videos to obtain training material for the AI.

Although there was some discussion within OpenAI about using YouTube videos, CEO Greg Brockman personally cooperated in collecting the data, and the result was GPT-4.

It is reported that some people at Google were aware of OpenAI's actions, but they took no action because Google, which needed training materials, was also training its own AI models using YouTube videos.

In July 2023, Google is changing its privacy policy to allow all content, such as Google Docs and Google Spreadsheets, to be used for training.

Researchers have pointed out that we will run out of training data by 2026, but in reality, we are already exceedingly stretched to this point.

In an email interview with The Verge, OpenAI spokesperson Lindsay Held said that OpenAI curates unique datasets for global research competitiveness, and that it uses a number of sources, including public data and private data from partnerships, and is considering generating its own synthetic data.

Meanwhile, Matt Bryant, a Google spokesman, said, 'Our robots.txt and terms of service prohibit the unauthorized scraping or downloading of YouTube content.'

YouTube CEO Neil Mohan has also made it clear that using YouTube data to train AI is against the rules.

