Apr 01, 2025 21:30:00

Pointing out that AI intelligence assessment tests do not have an evaluation item for 'ability to ask questions' that is important to humans

In recent years, companies and research institutes around the world have developed highly accurate AI, and many benchmarks and tests have been developed to measure its performance.

Dan Cohen , a professor of history at Northeastern University in the United States, pointed out that tests for AI overlook an important aspect of human beings, namely, the ability to ask questions, based on his experience of manually solving tests to measure AI performance.

Asking Good Questions Is Harder Than Giving Great Answers
https://newsletter.dancohen.org/archive/asking-good-questions-is-harder-than-giving-great-answers/

Recently, Cohen tried to solve the 'history' section of the Humanity's Last Exam , a test for AI, on his own. According to the researchers who developed Humanity's Last Exam, if an AI gets an 'A' on this test, it can be determined that the AI has the ability to replace humans.

Unfortunately, Cohen got an 'F' in the history section of Humanity's Last Exam. Cohen said he only got one question right in the history section, which he admitted was pretty embarrassing for someone with a PhD in history.

However, Cohen says that there are some problems that have become apparent from actually working on the Humanity's Last Exam. First of all, Humanity's Last Exam has more than 3,000 questions, of which more than 1,200 are about mathematics, while there are only 16 questions about history. In addition, it seems that four of the 16 history questions were about 'past naval battles,' which seems to have been one of the reasons why Cohen, who has little knowledge of warships, struggled.

He also said about other issues, 'It's a long, winding, narrative journey that's clearly intended to confuse the AI. These questions certainly succeeded in confusing me.'

This tendency in the Humanity's Last Exam questions implicitly replaces 'intelligence' with 'the ability to provide correct answers to complex questions.' AI development companies use these performance tests to claim that 'the performance of the new large-scale language model has improved by xx% over the previous model,' or 'the new AI has achieved a high accuracy rate in a doctoral-level test.'

Cohen acknowledges that AI has performed very well in a variety of tasks and tests to date, and that this has led to applications in real-world tasks. In fact, he says that the digital team at his library, where he is director, has created 'abstracted interfaces to all the major multimodal AI services that are much more capable than I am' and are delivering great results.

Cohen's fellow historian, Benjamin Breen, also reported that the latest AI has shown capabilities comparable to doctoral students in some areas, and outperforms many doctoral students in tasks such as translation and

transcription . In particular, the ability of AI to recognize handwritten characters in historical documents could have a major impact on historical research.

While Cohen acknowledges the capabilities of AI, he argues that doctoral-level work requires more than just getting the right answers; it also requires asking unique new questions.

'We may ultimately want answers, but we must start with new lines of inquiry, new areas of interest,' Cohen said. 'On the path to a better understanding of the past and present, good questions in history may ultimately require accurate translations of inscriptions or knowledge of the locations of naval battles. But before that, we must imagine why someone today would be interested in such documents and events in the first place, and how they have shaped our world. That's a much bigger challenge.'

For example, the book ' Listening in Paris ,' which Cohen recently read, starts from a simple question: 'Why have orchestra audiences become quieter?' In modern society, audiences become quiet when listening to orchestra concerts, but looking back at the past, this was not always the case, and there were times when audiences would sometimes get noisy during concerts. Cohen argued that asking such simple questions is itself an important part of research.

Related Posts:

Apr 01, 2025 21:30:00 in AI, Note, Posted by log1h_ik