Researchers warn that the rapid increase in AI artifacts has led to a 'loop in which AI learns AI-generated content' and that 'model collapse' is occurring



Generative AI continues to spread throughout society, with Adobe announcing '

Firefly ,' an image-generating AI that uses a copyright-clear training model, Microsoft's Edge search engine using its conversational AI ChatGPT, and a global consulting firm revealing that 50% of its employees are using generative AI in their work. However, as more people use AI to create and publish content, a new problem has emerged: the proliferation of AI-generated content on the internet, which, as AI learns from, creates serious flaws, according to a group of researchers.

[2305.17493] The Curse of Recursion: Training on Generated Data Makes Models Forget
https://doi.org/10.48550/arXiv.2305.17493



The AI feedback loop: Researchers warn of 'model collapse' as AI trains on AI-generated content | VentureBeat
https://venturebeat.com/ai/the-ai-feedback-loop-researchers-warn-of-model-collapse-as-ai-trains-on-ai-generated-content/



Large-scale language models (LLMs) have played a key role in the rapid spread of generative AI, such as Stable Diffusion , which can generate highly accurate images simply by entering a sentence (prompt), and ChatGPT , which creates highly accurate sentences in a conversational format. By combining flexible and adaptable LLMs with the collection of large amounts of training data, AI learns the structure of images and sentences.

Why has 'generative AI' that generates images and text suddenly developed? - GIGAZINE



Originally, the data used for LLM and other training was created by humans in the past, without the aid of artificial intelligence, such as books, internet articles, photographs, and illustrations. However, with the development of generative AI, an increasing number of people are creating content using AI and publishing it online, raising concerns that this is affecting the training data from which it is learned. At the end of May 2023, a group of researchers from the UK and Canada published a paper titled 'The Curse of Recursion' in the open-access journal arXiv . The paper states, 'Worrying facts have emerged for the future of generative AI technology.'

Ilya Shumailov, one of the lead authors of the paper, explained, 'Focusing on the probability distribution of text-to-text generative AI models and image-to-image generative AI models, we found that even under nearly ideal conditions for long-term learning, the process of 'data dispersion' is inevitable. Over time, errors in the generated data accumulate, and eventually, learning from the generated data leads to the AI further misperceiving reality. ' According to Shumailov, the learning model can quickly forget the original data it initially learned, and 'we were surprised to observe how quickly the model collapsed.'



Ross Anderson, a professor of security engineering at the University of Cambridge and the University of Edinburgh and another author of the paper, wrote about the research on his blog : 'Just as we've littered the oceans with plastic and filled the atmosphere with carbon dioxide, we're filling the internet with useless information. LLM is like fire: a useful tool, but it pollutes the environment.'

Anderson also points out that as the amount of content on the internet increases with AI-generated content, it becomes more difficult to scrape the web and train new models, which will ultimately favor companies that already have sufficient training data or can manage large amounts of human-generated content. Indeed, a blog post from the Internet Archive revealed that AI startups have made large-scale access requests to the Internet Archive in search of clean training data.



Shumailov explains that the mechanism by which AI content corrupts learning data is due to 'data bias.' According to Shumailov, while the original data generated by humans represents the world more fairly, generative AI models tend to over-prioritize popular data, often misunderstanding or misrepresenting less popular data.

For example, if a model is trained on 100 photos of cats, including 90 cats with yellow fur and 10 cats with blue fur, it may learn that 'yellow cats are more common,' but may also 'represent blue cats as yellowish.' When outputting new data, it may output 'green cats.' As the AI undergoes further training cycles to learn the 'yellowish blue cats' and 'green cats' it has generated, the blue cats will gradually become more yellowish, eventually turning all cats yellow. The researchers refer to this process of distortion and the eventual loss of characteristics from a small amount of data as 'data collapse.'



Furthermore, even if the model is trained to avoid this by not repeating the training cycle too many times, it has been found that model collapse still occurs because the model begins to fabricate incorrect responses to avoid frequent repetition of data.As a solution to model collapse, the paper presents the idea of 'avoiding contamination by AI-generated data by maintaining an exclusive, nominally human-created, high-quality copy of the original dataset and periodically retraining or completely refreshing it' and 'introducing a new, clean, human-generated dataset into training.'

Shumailov points out that to prevent data collapse, it's important to ensure adequate training, accurate feature depiction, and fair representation of underrepresented groups in the dataset. 'If you include 10% human-created data in the training, even with recursive AI content, the model collapse won't occur as quickly,' Shumailov told VentureBeat. 'But it's still going to collapse, even if it's not as quickly.'

in AI,   Software,   Web Service, Posted by log1e_dh