AI models are getting smarter so quickly that testing methods can't keep up



In recent years, AI that can be applied to various fields such as medicine and science has been increasing, and many of them have demonstrated abilities that surpass humans. The capabilities of such AI are measured through evaluation tests that quantify performance, but TIME magazine explains the current situation, saying that the speed at which evaluation tests are created is not keeping up with the progress of AI.

AI Models Are Getting Smarter. New Tests Are Racing to Catch Up | TIME

https://time.com/7203729/ai-evaluations-safety/

In the early days of generative AI, capabilities were measured by assessing a system's performance in specific tasks, such as classifying images or playing games, and it was common for it to take years for an AI to solve a new evaluation test after it was introduced. For example, it took five years for an AI to surpass humans in the ImageNet Large Scale Visual Recognition Challenge test, which was introduced in 2010.

However, it is said that the gap between the introduction of evaluation tests and passing them is narrowing significantly year by year.

The GLUE assessment, which debuted in 2018, tested an AI's ability to understand natural language by measuring the task of determining the correct meaning of pronouns from context, but was solved a year after its launch. A more difficult version, SuperGLUE, was created in 2019, but within two years the AI was able to match human performance.

The accuracy rate is also surprisingly high. In the 'Measuring Massive Multitask Language Understanding (MMLU)' evaluation test, which consists of approximately 16,000 multiple-choice questions spanning a wide range of fields including philosophy, medicine, and law, OpenAI's 'GPT-4o' model, released in May 2024, achieved an accuracy rate of 88%, while the company's latest model, o1, recorded an accuracy rate of 92.3%.



This is creating a big challenge: with modern AIs regularly scoring at the top of existing assessment tests, it's hard to gauge how quickly systems are improving. In addition, because assessment tests only measure the basic capabilities of an AI, there are questions about whether it will perform as well as it is assessed to in realistic scenarios. Marius Höbhern, an AI safety researcher, points out that creating such assessment tests is 'surprisingly difficult.'

To meet these challenges, new, more sophisticated assessment tests are being developed.

The FrontierMath assessment test, designed by a research institute called Epoch AI, is made up of about 300 math problems devised by leading mathematicians, with difficulty levels ranging from the level of the International Mathematical Olympiad to a level that 'very talented high school students can theoretically solve.' It is known to be very difficult compared to existing math tests, but OpenAI's o3 model has already scored 25.2% on the test, which has surprised mathematicians.

Mathematicians talk about the shock of OpenAI's o3 model scoring 25.2% on the ultra-difficult math dataset 'FrontierMath' - GIGAZINE



Scale AI is also developing an assessment called the ominously named ' Humanity's Last Exam ,' which aims to include 20 to 50 times as many questions as Frontiermath, while also covering areas such as physics, biology, and electrical engineering, and is expected to be available by early 2025.



However, even if such an evaluation test is designed, it is clear that AI will eventually achieve a high score, and it is costly to develop the evaluation test. 'AI evaluation is by no means cheap, and the cost of development far exceeds the cost of conducting the evaluation,' said Tamey Besiroglu of Epoch AI.

'As AI models advance rapidly, evaluations are racing to keep up, but effective tests remain difficult to design, costly and underfunded compared to their importance for spotting dangerous capabilities early,' TIME said. 'With leading labs releasing high-performance models every few months, the need for new tests to evaluate the models' capabilities has never been greater.'

in Software, Posted by log1p_kr