OpenAI announces 'Operator' that can request AI to perform tasks on any website
OpenAI has announced a research preview version of ' Operator ,' an AI that automatically operates a browser according to user instructions, and has also released data on ' Computer-Using Agent (CUA), ' a model of Operator.
Introducing Operator research preview | OpenAI
https://openai.com/index/introducing-operator/
Computer-Using Agent | OpenAI
https://openai.com/index/computer-using-agent/
A research preview of Operator, an agent that can use its own browser to perform tasks for you. pic.twitter.com/wkBBDIlVqj
— OpenAI (@OpenAI) January 23, 2025
The Operator model, 'Computer-Using Agent (CUA)', has the thinking ability of GPT-4o, but has been additionally trained so that it can see the screen like a human and operate buttons, menus, text fields, etc. Because it uses the same operating system as a human, it has the advantage of being able to flexibly execute tasks without using various service or application-specific APIs.
The CUA used in the research preview version of Operator released this time has established a new state-of-the-art in both computer operation and browser operation. Comparisons with previous state-of-the-art models in each field are as follows: For PC operation, it only achieved a score of 38.1%, about 34 points lower than humans' 72.4%, but for browser operation, it achieved a score of 58.1%, narrowing the gap with humans to about 20 points.
Benchmark Type | benchmark | Computer Use (Universal Interface) | Web Browsing AI | human | |
---|---|---|---|---|---|
OpenAI CUA | Previous State of the Art (SOTA) | Previous State of the Art (SOTA) | |||
Computer Operation | OSWorld | 38.1% | 22.0% | - | 72.4% |
Browser operation | WebArena | 58.1% | 36.2% | 57.1% | 78.2% |
WebVoyager | 87.0% | 56.0% | 87.0% | - |
The mechanism of the model is as follows: The CUA repeats three steps until the task is completed: 'recognize the screen state,' 'think about the next operation,' and 'execute it.' The screen state is added to the context as a screenshot.
The OpenAI blog provides detailed examples that walk you through all the steps to complete a browser task, including the prompt, 'Visit the Plus section of
the Cambridge Dictionary,
take the recommended grammar quiz without logging in, and let us know your score.'AI opened the site.
I find the word 'Plus' and click on it.
An advertisement appeared, but I was able to click 'Close' appropriately.
Scroll to find 'Suggested Grammar Quizzes.'
I discovered 'Grammar Quiz' and clicked on it.
After completing these steps a total of 152 times, I reported my score and completed the task as shown below.
On the other hand, while it was significantly slower than humans on PC operation tasks, it significantly outperformed the previous state-of-the-art model,
the Claude 3.5 sonnet
.The Operator, an AI agent for automatic browser operation released by OpenAI, combines a browser with a CUA and automatically operates the browser with just prompt instructions. The user can also take over the operation in the middle of the AI's operation. When a task that requires user assistance appears, such as logging in, making a payment, or solving a CAPTCHA, the AI is trained to ask the user for assistance.
At the time of writing, the research preview version of Operator is available only to users living in the United States who have a Pro plan.
Related Posts:
in Software, Web Service, Posted by log1d_ts