Apr 08, 2024 16:00:00

Meta reportedly discussed scraping copyrighted works to strengthen its AI, even if it means being sued

The New York Times reported, citing records of confidential meetings, that Meta executives and lawyers were 'considering using copyrighted content to train AI, even if it meant risking litigation.'

Four Takeaways on the Race to Amass Data for AI - The New York Times

https://www.nytimes.com/2024/04/06/technology/ai-data-tech-takeaways.html

Tech giants: How tech giants cut corners to harvest data for AI - The Economic Times
https://economictimes.indiatimes.com/tech/technology/how-tech-giants-cut-corners-to-harvest-data-for-ai/articleshow/109093168.cms

Training an AI model requires a huge amount of data. For example, OpenAI's GPT-3 was trained using more than 3 trillion tokens of text, including 410 billion tokens of webpage text collected since 2007, as well as scans of books and social media posts.

Huge amounts of data are constantly being generated on the Internet, and it is estimated that the total amount of data acquired and consumed worldwide will reach more than 180 zettabytes (1 zettabyte is 1 trillion gigabytes) per year by 2025. However, because AI requires even more data than that, a 2022 paper

estimated that high-quality data that can be used to train AI will run out by 2026.

In an effort to win the intensifying 'AI arms race,' companies are desperately trying to collect data that even ignores the rules. In an article published on April 6, 2024, The New York Times reported that 'OpenAI used videos to train its AI in violation of YouTube's terms of service' and that 'YouTube's parent company, Google, also engages in similar behavior and therefore condones OpenAI's violations.'

It turns out that OpenAI used over 1 million hours of YouTube videos to train AI models - GIGAZINE

In a separate article, The New York Times reported that it had obtained recordings of Meta executives discussing with lawyers how to obtain the vast amounts of data needed to develop the company's AI.

Ahmad Al-Dahleh, Meta's vice president of generative AI, who was growing impatient with the presence of ChatGPT, met with AI development business leaders and lawyers almost daily from March to April 2023, urging them that they could not catch up with OpenAI without more data.

Among the proposals considered were a flat $10 licensing fee for each new book, or the acquisition of Simon & Schuster, a major publisher that handles works by popular authors such as J.K. Rowling and Stephen King.

There was also talk of hiring African companies to abstract copyrighted works from the internet and books without permission, and a call to 'siphon off even more works, even if it means litigation.'

In response, one lawyer raised ethical concerns, asking whether it was appropriate to take away artists' intellectual property rights, but the conversation was met with a heavy silence, according to the recording.

'The only obstacle to building something as great as ChatGPT is literally the amount of data,' Nick Grudin, Meta's vice president of global partnerships and content, said at a conference.

Related Posts:

Apr 08, 2024 16:00:00 in Software, Posted by log1l_ks