Research shows that AI tricks humans, especially with the Gemini 1.5 Pro

A research team at Stanford University announced that tests on major large-scale language models such as
[2502.08177] SycEval: Evaluating LLM Sycophancy
https://arxiv.org/abs/2502.08177
SycEval: Evaluating LLM Sycophancy
https://arxiv.org/html/2502.08177

Large Language Models Show Concerning Tendency to Flatter Users, Stanford Study Reveals
The research team developed a comprehensive evaluation framework for large-scale language models and tested their models on mathematics tasks using the AMPS dataset and medical advice using the MedQuAD dataset .
In the experiment, a total of 3,000 prepared questions were posed to each large-scale language model, and additional scripts were presented to generate counterarguments of increasing rhetorical strength, and the generated answers were compared to the real answers, totaling 24,000.

The research team also used a generative AI-based automated evaluation technique called 'LLM-As-A-Judge' to classify the generated answers into three types: 'correct,' 'incorrect,' and 'inappropriate.' In this case, 'incorrect' means that the generated answer differs from the true answer, and 'inappropriate' means that the model refrains from answering the query or provides a response that is unrelated to the content contained in the query.

The results showed that 58.19% of all samples were in agreement with the opinions of human users, with Gemini 1.5 Pro showing the highest agreement rate at 62.47%, followed by Claude 3.5 Sonnet at 57.44% and GPT-4o at 56.71%.
The research team also classified the generated results into 'progressive resonance,' in which the AI generates answers toward the correct answer, and 'regressive resonance,' in which the AI presents incorrect answers to match the user's opinion. As a result, it was revealed that of the 58.19% of all samples, 43.52% were progressive resonance and 14.66% were regressive resonance. By model, Gemini 1.5 Pro had 53.22% progressive resonance and 9.25% regressive resonance. Claude 3.5 Sonnet had 39.13% progressive resonance and 18.31% regressive resonance. And GPT-4o has 42.32% progressive resonance and 14.40% regressive resonance.
In light of the results of this experiment, the research team raises concerns about the reliability of AI systems in educational environments, medical diagnosis and advice, professional consulting, and solving technical problems. 'If an AI model prioritizes user opinion over independent reasoning, it could undermine its ability to provide accurate and useful information, which could be a major problem in situations where correct information is critical for decision-making or safety,' the team said.
According to the researchers, this behavior of the AI prioritizing user opinions may reflect training the model to maximize positive feedback, as AI companies have taught their AI models that users tend to respond better when they agree with them.
Therefore, the research team emphasized the need to 'improve training methods that balance cooperation and accuracy,' 'develop better evaluation frameworks to detect behavior that aligns with user opinions,' 'develop AI systems that can maintain their usefulness while remaining autonomous,' and 'implement safety measures in important applications.' The research team also said, 'The results obtained from this study lay the foundation for developing reliable AI systems for high-stakes applications that must prioritize accuracy over alignment to impress users.'

Related Posts: