It turns out that OpenAI used over 1 million hours of YouTube videos to train its AI models
The New York Times has reported that OpenAI has downloaded and used more than one million hours of YouTube videos to train its AI models. Google, which is owned by the same parent company as YouTube, Alphabet, was aware of OpenAI's actions but did not take action because it also uses YouTube videos to train its own AI models.
How Tech Giants Cut Corners to Harvest Data for AI - The New York Times
Google reportedly lets OpenAI transcribe a million hours of YouTube videos to train GPT-4 - Neowin
OpenAI transcribed over a million hours of YouTube videos to train GPT-4 - The Verge
https://www.theverge.com/2024/4/6/24122915/openai-youtube-transcripts-gpt-4-training-data-google
OpenAI and Google reportedly used transcriptions of YouTube videos to train their AI models
https://www.engadget.com/openai-and-google-reportedly-used-transcriptions-of-youtube-videos-to-train-their-ai-models-163531073.html
According to The New York Times, OpenAI ran out of 'trusted English text' on the Internet as of 2021, and new text was needed to develop the next AI.
To that end, we developed 'Whisper,' which enables highly accurate transcription.
OpenAI announces high-performance transcription AI 'Whisper', which supports Japanese and can transcribe tongue twisters and lyrics with high accuracy - GIGAZINE
They then used Whisper to transcribe YouTube videos to obtain training material for the AI.
Although there was some discussion within OpenAI about using YouTube videos, CEO Greg Brockman personally cooperated in collecting the data, and the result was GPT-4.
It is reported that some people at Google were aware of OpenAI's actions, but they took no action because Google, which needed training materials, was also training its own AI models using YouTube videos.
In July 2023, Google is changing its privacy policy to allow all content, such as Google Docs and Google Spreadsheets, to be used for training.
Researchers have pointed out that we will run out of training data by 2026, but in reality, we are already exceedingly stretched to this point.
What is the 'data shortage problem' that will cause data used for AI training to run out by 2026? - GIGAZINE
In an email interview with The Verge, OpenAI spokesperson Lindsay Held said that OpenAI curates unique datasets for global research competitiveness, and that it uses a number of sources, including public data and private data from partnerships, and is considering generating its own synthetic data.
Meanwhile, Matt Bryant, a Google spokesman, said, 'Our robots.txt and terms of service prohibit the unauthorized scraping or downloading of YouTube content.'
YouTube CEO Neil Mohan has also made it clear that using YouTube data to train AI is against the rules.
Related Posts:
in Software, Web Service, Posted by logc_nt