Anthropic reports that agent coding performance varies by several percentage points depending on hardware configuration, and the difference in benchmark scores between high-performance models may be due to the benefit of high-performance hardware.

Anthropic, the developer of Claude, reports that 'infrastructure configuration can cause agent coding benchmarks to vary by several percentage points,' which can affect the rankings of popular AI models on benchmark leaderboards.
Quantifying infrastructure noise in agentic coding evals \ Anthropic

Agent coding benchmark tests such as SWE-bench and Terminal-Bench are widely used to compare the software engineering capabilities of state-of-the-art AI models.
The top positions on these benchmark leaderboards often vary by just a few percentage points, and benchmark scores are often used as an accurate measure of the relative capabilities of AI models, making them important inputs when deciding which models to deploy.
However, Anthropic reported that infrastructure configuration alone can make more than just a small difference on the leaderboard: In their internal testing, Anthropic found that the difference between the most and least resourced configurations on Terminal-Bench 2.0 was 6 percentage points.
Static benchmarks directly evaluate the output of an AI model, so the execution environment doesn't affect the results. Agent-based coding evaluation, on the other hand, is different. The model is given a complete environment in which it writes its program, runs tests, installs dependencies, and iterates over multiple turns. The runtime is no longer a passive container, but an integral part of the problem-solving process. No two agents with different resource budgets and time constraints will receive the same test.
Evaluation developers have begun to take this into account. For example, Terminal-Bench 2.0 now specifies recommended CPU and RAM for each task in its latest 2.0 release. However, specifying resources and consistently enforcing them are not the same thing. Furthermore, it has been found that the method of enforcement can change what the benchmark actually measures.

Anthropic noticed this when they ran Terminal-Bench 2.0 on
It was later discovered that the discrepancy in scores was due to enforcement. Google Kubernetes Engine treated each task's resource specification as both a minimum and maximum limit. This meant that each container was guaranteed the specified resources, but was killed the moment it exceeded those limits. Furthermore, the container runtime killed resources based on two different parameters: a guaranteed resource allocation that was reserved in advance, and a hard limit at which a container was killed.
With this in mind, the Terminal-Bench leaderboard uses a different sandbox provider with a more flexible implementation that allows temporary over-allocation without terminating containers in order to prioritize infrastructure stability.
At this point, we wondered, 'To what extent does resource configuration affect the evaluation score?' To quantify the impact of different resource configurations on benchmark results, we conducted an experiment in which we ran Terminal-Bench 2.0 with six different resource configurations. We found that the task success rate improved as the resource limit increased.
We also found that when resource allocation was increased to three times or more the Terminal-Bench specification, the additional resources actively helped the agent solve tasks it was unable to solve, which in turn affected its benchmark test score.
An agent that writes lean, efficient code very quickly can perform well under strict constraints, while an agent that uses heavy tools to brute force solutions can perform well under more lenient constraints. 'While both are valid test subjects, combining them into a single score without specifying resource configuration makes it difficult to interpret their generalizability to the real world,' Anthropic noted.
Different AI models have different default approaches, and resource configurations determine which approach is more successful. The direction of the effect was consistent, but the magnitude seemed to vary. 'The same trend appears to hold for AI models other than Claude, but we have not rigorously tested it,' Anthropic explains.

According to Anthropic, it's not just resource allocation that affects benchmark scores; time limits also seem to affect scores in certain settings.
Therefore, Anthropic states that the ideal scenario for agent coding benchmarks is to run each evaluation under the exact same hardware conditions, ensuring complete reproducibility across the board. However, they understand that this is not always realistic.
'Benchmark scores are increasingly being used to inform decision-making, but this increased attention has not necessarily been accompanied by rigor in how they are performed and reported. As things stand, a 2 percentage point lead on a leaderboard could reflect a true difference in ability, or it could simply reflect running the benchmark on more powerful hardware, or at a more favorable time of day, or both,' Anthropic said.
Benchmark maintainers would be well served by publishing their recommended resource specifications, and clarifying enforcement methods would help fill gaps like those identified by Anthropic. The key takeaway for those using benchmark results is to understand that small score differences in agent ratings carry greater uncertainty than the precision of the reported numbers would suggest.
Related Posts:
in AI, Posted by logu_ii







