The most difficult AI test ever, 'The Last Test for Humanity,' has been released, consisting of 3,000 multiple-choice and short-answer questions

AI company Scale AI and AI research organization Center for AI Safety (CAIS) have jointly released a benchmark called ' Humanity's Last Exam,' designed to test the limits of AI knowledge. None of the major existing models achieved an accuracy rate of more than 10%.
Scale AI and CAIS Unveil Results of Humanity's Last Exam
https://scale.com/blog/humanitys-last-exam-results

Humanity's Last Exam - Publication Ready Humanity's Last Exam.pdf
(PDF file)
A Test So Hard No AI System Can Pass It — Yet - The New York Times
https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html
'The Last Test of Humanity' is a benchmark packed with problems from a wide range of fields, including mathematics, humanities, and natural sciences. Each problem was carefully selected from university professors and well-known mathematicians, and although there are answers, they are all extremely difficult to solve. Kevin Chou, a postdoctoral researcher in particle theory at the University of California, Berkeley, who provided the problems, said, 'All of the problems used were within the scope of what would be asked in graduate school exams.'
In the field of ecology, questions such as, 'Apodiformes hummingbirds have paired oval sesamoid bones embedded in the caudal lateral part of the broad, cross-shaped aponeurosis of the caudal depressor muscle. How many pairs of tendons are supported by these sesamoid bones? Please answer in numbers.' are given.
The test consists of 3,000 questions, mostly multiple choice and short answer questions. Scale AI and CAIS ran the benchmark against several AI models, including OpenAI's GPT-4o, Anthropic's Claude 3.5 Sonnet, and Google's Gemini 1.5 Pro, and found that no model achieved a correct answer rate of more than 10%, with the highest score being 8.3% for OpenAI's o1, which has high inference capabilities.

CAIS co-founder and executive director Dan Hendricks said, 'We can't predict how quickly models will improve,' in response to the failure of existing models that score highly on existing tests. He predicted that a model with an accuracy rate of over 50% will emerge within the next year.
The reason for creating such a benchmark is that the rate of progress of AI is so fast that existing benchmarks cannot measure its accuracy. For example, in the MATH benchmark proposed by Hendrix in 2021 and widely used, there were no models exceeding 10% at the time of its announcement, but three years later, models reaching 90% have appeared.
Chinese AI company DeepSeek releases 'DeepSeek-R1-Lite-Preview', an inference AI model comparable to OpenAI o1, with plans to open source it - GIGAZINE

'We plan to open the dataset to the research community to continue exploring the limitations of existing models, digging deeper and evaluating new AI models,' said Summer Yue, research director at Scale AI. 'This 'last test of humanity' has been meticulously designed to be the ultimate test, challenging the world's most advanced models.'
Continued
In the 'final test of humanity,' where the highest answer accuracy was about 9%, OpenAI's Deep research recorded more than 26% - GIGAZINE

Related Posts:
in Posted by log1p_kr