2025年02月19日 11時44分ソフトウェア

OpenAIがAIベンチマーク「SWE-Lancer」を公開、フリーランスエンジニアに100万ドルで依頼するレベルのタスクをこなせるか測定

OpenAIが2025年2月18日に、AIモデルのコーディング性能を評価するためのオープンソースのベンチマーク「SWE-Lancer」を公開しました。

[2502.12115] SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
https://arxiv.org/abs/2502.12115

Introducing the SWE-Lancer benchmark | OpenAI
https://openai.com/index/swe-lancer/

Today we’re launching SWE-Lancer—a new, more realistic benchmark to evaluate the coding performance of AI models. SWE-Lancer includes over 1,400 freelance software engineering tasks from Upwork, valued at $1 million USD total in real-world payouts. https://t.co/c3pFcL41uK
— OpenAI (@OpenAI) February 18, 2025

SWE-Lancerは、フリーランスのソフトウェアエンジニアが総額約100万ドル(約1億5000万円)で受けるタスクをAIが実行できるかどうか測定するベンチマークツールで、50ドル(約7500円)相当のバグ修正から3万2000ドル(約480万円)相当の機能実装まで、独立したエンジニアリングタスクと、モデルが技術的な実装案を選択する管理タスクの両方をテストすることが可能です。

SWE-Lancer tasks span the full engineering stack, from UI/UX to systems design, and include a range of task types, from $50 bug fixes to $32,000 feature implementations. SWE-Lancer includes both independent engineering tasks and management tasks, where models choose between… pic.twitter.com/3Dg8bjHOSk
— OpenAI (@OpenAI) February 18, 2025

SWE-Lancerで測定されるタスク価格は、実際の市場価値を反映しておりタスクが難しければ難しいほど価格も上昇します。

SWE-Lancer task prices reflect real-world market value. Harder tasks demand higher payments. pic.twitter.com/0FGWm88RE8
— OpenAI (@OpenAI) February 18, 2025

OpenAIは「SWE-Lancerを用いてAIモデルのパフォーマンスを測定したところ、現在のAIモデルはまだ大部分のタスクを解決することができませんでした」と報告しています。実際にOpenAIが掲載した論文では、100万ドル相当のタスクに対し、GPT-4o・o1・Claude 3.5 Sonnetが遂行できたタスクは約30万ドル(約4500万円)～40万ドル(約6000万円)相当だったことが示されています。

Current frontier models are unable to solve the majority of tasks. pic.twitter.com/GP3C3UR3cB
— OpenAI (@OpenAI) February 18, 2025

OpenAIは「モデルのパフォーマンスを金銭的価値にマッピングすることで、SWE-LancerがAIモデル開発の経済的影響に関するより多くの研究を可能にすることを願っています」と述べています。

また、OpenAIは将来的な研究に役立てるためにSWE-Lancerをオープンソース化しています。SWE-LancerのソースコードはGitHubで確認可能です。

GitHub - openai/SWELancer-Benchmark: This repo contains the dataset and code for the paper "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?"
https://github.com/openai/SWELancer-Benchmark