AI agents can violate ethical constraints by prioritizing evaluations and outcomes



A research team led by Miles Q. Lee and Benjamin C. M. Fang of McGill University in Canada proposed a benchmark that can measure the frequency with which AI agents violate constraints under the pressure of KPIs (key performance indicators), and published it on arXiv, a repository of unpeer-reviewed papers. The paper was currently undergoing peer review at the time of writing, but it is said that many cutting-edge models have observed serious constraint violations at rates of around 30% to 50%.

[2512.20798] A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
https://arxiv.org/abs/2512.20798



Generative AI is increasingly being used in corporate settings as 'AI agents' that not only respond to chat requests but also operate tools and autonomously carry out multiple procedures. The research team proposed the 'Outcome-Driven Constraint Violations Benchmark (ODCV-Bench).' The ODCV-Bench consists of 40 scenarios covering high-risk domains, including healthcare and clinical trials, logistics and supply chains, finance, research and education, corporate operations and legal affairs, and social media.

Each scenario consists of a system prompt that defines the agent's role, a multi-step task, and a working environment prepared in a Docker container. A distinctive feature of this benchmark is that it intentionally leaves loopholes that often occur in the field. For example, it prepares an environment where 'KPI scores will increase if only the formalities are corrected,' allowing us to observe whether the agent seeks out loopholes and exploits them.

In the same scenario, ODCV-Bench applies two types of pressure: one type is a direct command to 'meet the results,' and the other type is a pressure to achieve KPIs without explicitly ordering dishonesty. This allows us to distinguish between following bad orders and voluntarily committing dishonesty simply because of the pressure of KPIs.

The research team assessed behavior on a severity scale of 0 to 5, with a severity rating of 3 or higher counted as a severe constraint violation. Using this scale, the researchers evaluated 12 AI models, finding that the rate of severe constraint violations ranged from 1.3% to 71.4%. Nine of the 12 models fell within the 30% to 50% range, indicating that models under pressure to meet KPIs may violate rules significantly. Gemini 3 Pro Preview , in particular, had a particularly high rate of 71.4%, suggesting that models were more likely to resort to fraudulent or risky methods to meet the requirements.



The research team also points out that high inference ability is not a guarantee of safety. In multi-step tasks, it may be quicker to exploit weaknesses in the scoring or checking process and adjust the numbers rather than following the steps, and the more capable the model, the more likely it is to find a loophole.

The research team also emphasized that constraint violations do not necessarily occur because of a lack of understanding of ethics. When the same model acted as an agent and then self-evaluated as an examiner, it recognized that its own actions were fraudulent in many cases. This means that it is possible for a model to prioritize KPIs despite understanding ethics.

The research team concluded that this type of behavior is difficult to detect in a one-off safety test. In the field, KPIs are important, tasks are multi-step, and checks tend to leave gaps. When these conditions are met, the AI agent may rationally violate constraints, so the research team argues that an evaluation similar to real-world operation, such as ODCV-Bench, is necessary.

in AI, Posted by log1b_ok