It turns out that Apple, NVIDIA, Anthropic and others used YouTube video subtitles without permission to train AI



Proof News, an IT media outlet, has reported that companies including Apple, Anthrpic, and NVIDIA have used subtitles from over 170,000 videos posted on YouTube without permission to train their AI.

Apple, Nvidia, Anthropic Used Thousands of Swiped YouTube Videos to Train AI

https://www.proofnews.org/apple-nvidia-anthropic-used-thousands-of-swiped-youtube-videos-to-train-ai/



YouTube creators surprised to find Apple and others trained AI on their videos | Ars Technica
https://arstechnica.com/ai/2024/07/apple-was-among-the-companies-that-trained-its-ai-on-youtube-videos/



Proof News has conducted a detailed investigation into AI training data, and in particular, has focused on a dataset called ' The Pile ' created by non-profit AI research organization EleutherAI . The dataset includes data from the European Parliament, the English version of Wikipedia, a large number of emails from Enron employees that were made public as part of an investigation, and 'YouTube Subtitles,' a collection of YouTube video subtitles.

The YouTube Subtitles collection is a compilation of 489 million words from 173,536 videos published by over 48,000 channels, totaling 5.7GB in size. The channels included were those of super-famous YouTubers such as MrBeast and PewDiePie, as well as channels promoting conspiracy theories such as the flat Earth theory.



Proof News examined research papers and public information from various AI companies and noted that companies including Apple, Anthropic, NVIDIA, Salesforce, Bloomberg, and Databricks have used The Pile to train their own AI.

YouTube has a huge number of videos uploaded to it, so it is often used to train AI. In April 2024, it was reported that OpenAI had downloaded over 1 million hours of videos from YouTube and used them to train AI.

It turns out that OpenAI used over 1 million hours of YouTube videos to train AI models - GIGAZINE



However, YouTube CEO Neil Mohan stated that 'if you use YouTube videos to train, you are breaking the rules.'

YouTube CEO says 'Using AI for training is against the rules' and 'What's important is that creators succeed on YouTube' - GIGAZINE



Google spokesperson Jack Maron told Proof News that 'we have taken steps to prevent unauthorized scraping for many years,' but declined to comment on other companies using YouTube video captions as training data.

Anthropic spokesperson Jennifer Martinez said, 'The Pile contains a small portion of YouTube captions,' acknowledging that the company used YouTube caption data for training purposes. However, she added, 'YouTube's terms cover direct use of their platform, not use of The Pile. Any indications of potential violations of YouTube's terms of service should be directed to the creators of The Pile.'

'We used The Pile to build AI models for academic and research purposes. The Pile is a publicly available dataset,' said Keimin Shion, vice president of AI research at Salesforce.

An NVIDIA representative declined to comment when contacted by Proof News, while Apple, Bloomberg, and Databricks did not respond to requests for comment.

Continued
Apple refutes reports that YouTube subtitles were used for AI training, saying they are not used in production AI including 'Apple Intelligence' - GIGAZINE



in Software,   Web Service, Posted by log1i_yk