OpenAI releases AI benchmark 'SWE-Lancer' to measure whether a machine can perform tasks that would cost a freelance engineer $1 million



On February 18, 2025, OpenAI released SWE-Lancer , an open source benchmark for evaluating the coding performance of AI models.

[2502.12115] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

https://arxiv.org/abs/2502.12115

Introducing the SWE-Lancer benchmark | OpenAI
https://openai.com/index/swe-lancer/






SWE-Lancer is a benchmark tool that measures whether an AI can perform tasks that freelance software engineers would receive for a total of about $1 million (about 150 million yen). It can test both independent engineering tasks, ranging from fixing a bug worth $50 (about 7,500 yen) to implementing a feature worth $32,000 (about 4.8 million yen), and management tasks in which the model selects technical implementation plans.




Task prices measured by SWE-Lancer reflect actual market value, and the more difficult the task, the higher the price.




OpenAI reports that 'When we measured the performance of AI models using SWE-Lancer, we found that current AI models are still unable to solve the majority of tasks.' In fact, a paper published by OpenAI showed that for a task worth $1 million, GPT-4o, o1, and Claude 3.5 Sonnet were able to complete tasks worth about $300,000 (about 45 million yen) to $400,000 (about 60 million yen).




'By mapping model performance to monetary value, we hope that SWE-Lancer will enable more research into the economic impact of AI model development,' OpenAI said.

OpenAI has also open-sourced SWE-Lancer to facilitate future research. The SWE-Lancer source code can be found on GitHub.

GitHub - openai/SWELancer-Benchmark: This repo contains the dataset and code for the paper 'SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?'
https://github.com/openai/SWELancer-Benchmark

in Software, Posted by log1r_ut