Jan 24, 2025 12:40:00

OpenAI announces 'Operator' that can request AI to perform tasks on any website

OpenAI has announced a research preview version of ' Operator ,' an AI that automatically operates a browser according to user instructions, and has also released data on ' Computer-Using Agent (CUA), ' a model of Operator.

Introducing Operator research preview | OpenAI

https://openai.com/index/introducing-operator/

Computer-Using Agent | OpenAI
https://openai.com/index/computer-using-agent/

A research preview of Operator, an agent that can use its own browser to perform tasks for you. pic.twitter.com/wkBBDIlVqj
— OpenAI (@OpenAI) January 23, 2025

The Operator model, 'Computer-Using Agent (CUA)', has the thinking ability of GPT-4o, but has been additionally trained so that it can see the screen like a human and operate buttons, menus, text fields, etc. Because it uses the same operating system as a human, it has the advantage of being able to flexibly execute tasks without using various service or application-specific APIs.

The CUA used in the research preview version of Operator released this time has established a new state-of-the-art in both computer operation and browser operation. Comparisons with previous state-of-the-art models in each field are as follows: For PC operation, it only achieved a score of 38.1%, about 34 points lower than humans' 72.4%, but for browser operation, it achieved a score of 58.1%, narrowing the gap with humans to about 20 points.

Benchmark Type	benchmark	Computer Use (Universal Interface)		Web Browsing AI	human
Benchmark Type	benchmark	OpenAI CUA	Previous State of the Art (SOTA)	Previous State of the Art (SOTA)	human
Computer Operation	OSWorld	38.1%	22.0%	-	72.4%
Browser operation	WebArena	58.1%	36.2%	57.1%	78.2%
Browser operation	WebVoyager	87.0%	56.0%	87.0%	-

The mechanism of the model is as follows: The CUA repeats three steps until the task is completed: 'recognize the screen state,' 'think about the next operation,' and 'execute it.' The screen state is added to the context as a screenshot.

The OpenAI blog provides detailed examples that walk you through all the steps to complete a browser task, including the prompt, 'Visit the Plus section of

the Cambridge Dictionary, take the recommended grammar quiz without logging in, and let us know your score.'

AI opened the site.

I find the word 'Plus' and click on it.

An advertisement appeared, but I was able to click 'Close' appropriately.

Scroll to find 'Suggested Grammar Quizzes.'

I discovered 'Grammar Quiz' and clicked on it.

After completing these steps a total of 152 times, I reported my score and completed the task as shown below.

On the other hand, while it was significantly slower than humans on PC operation tasks, it significantly outperformed the previous state-of-the-art model,

the Claude 3.5 sonnet .

The Operator, an AI agent for automatic browser operation released by OpenAI, combines a browser with a CUA and automatically operates the browser with just prompt instructions. The user can also take over the operation in the middle of the AI's operation. When a task that requires user assistance appears, such as logging in, making a payment, or solving a CAPTCHA, the AI is trained to ask the user for assistance.

At the time of writing, the research preview version of Operator is available only to users living in the United States who have a Pro plan.

Related Posts:

Jan 24, 2025 12:40:00 in Software, Web Service, Posted by log1d_ts