What is the 'data shortage problem' that will cause the data used for AI training to run out by 2026?



There is a huge amount of data on the Internet, and AIs created by learning from this data are appearing one after another. While the spread of AI continues to explode, researchers are concerned that the training data that powers AI systems may run out.

Researchers warn we could run out of data to train AI by 2026. What then?

https://theconversation.com/researchers-warn-we-could-run-out-of-data-to-train-ai-by-2026-what-then-216741

Accurate and powerful AI training requires vast amounts of data. According to Rita Maturionite, a senior lecturer at the Faculty of Law at Macquarie University in Australia and an expert on the legal regulations of technology in the creative industries, ChatGPT is trained on 570GB of text data, which is approximately 300 billion words. .

Similarly, the Stable Diffusion algorithm that powers image generation AI such as DALL-E, Lensa, and Midjourney was trained on the dataset ``LIAON-5B'', which consists of 5.8 billion image and text pairs. If these algorithms don't have enough training data, the data the AI outputs will be inaccurate or of low quality.



Not only the quantity but also the quality of training data is important. For example, low-quality data such as social media posts or blurry photos is easy to obtain, but it is not suitable for training high-performance AI models.

A more serious problem is that text data obtained from SNS is at risk of being full of bias and discrimination, as well as false information and illegal content.

As an example, when Microsoft tried to train AI using the content of X (then Twitter), the AI started generating misogynistic and racist comments.

Microsoft's artificial intelligence comes under fire after making a series of problematic statements such as ``Fucking feminists should burn in hell'' and ``Hitler was right'' and ceases operations - GIGAZINE



This precedent has led AI developers to seek out high-quality data, such as books, scientific papers, Wikipedia, online articles, and filtered text of specific content. For example, Google is using 11,000 romance novels from the self-publishing site Smashwords to improve the conversation function of Google Assistant.

High-performance models such as ChatGPT and DALL-E 3 have been created by training on such high-quality and huge datasets, but the limits to their growth are beginning to appear. A paper published on the preprint server arXiv in 2022 states, ``If AI is trained at the current rate, high-quality text data will run out by 2026, and low-quality text data will be lost between 2030 and 2050. Meanwhile, it is predicted that low-quality image data will be exhausted between 2030 and 2060.

According to consulting firm PwC, AI could have an economic impact of up to $15.7 trillion (approximately 2,363,886 billion yen) on the global economy by 2030. However, if the data on which AI is trained runs out by 2030, the development of AI will be delayed.



However, Maturionite says, 'The situation may not be as bad as it seems.' This is because there are many unknowns regarding the development of AI models.

We are also exploring ways to address the risk of data scarcity. One way to do this is to improve algorithms to use existing data more efficiently. If we can save data, it will be possible to train more powerful AI systems with less computing power, which will also lead to a reduction in carbon dioxide emissions during the AI development process.

Another method is to use AI to synthesize training data. This allows AI developers to synthesize the data they need to fit a specific AI model. Several projects are already using synthetic content obtained from MOSTLY AI, a company that creates synthetic data for AI models, and Maturionite believes this method will become more common in the future.

AI developers are also finding their way outside the free internet, such as content owned by major publishers and offline repositories. News Corp, one of the world's largest owners of news content, announced in September 2023 that it was negotiating content deals with AI developers. In this way, AI development, which has traditionally used free content without permission, is shifting toward paying for paid content.

Regarding this trend, Maturionite said, ``Creators are protesting the unauthorized use of their content to train AI models, and some are suing AI companies such as Microsoft, OpenAI, and Stability AI. ``Being able to get paid for their work may also improve the power imbalance that exists between creators and AI companies.''

in Software, Posted by log1l_ks