Ai2 has released 'MolmoWeb,' a visual web AI agent that operates using browser screenshots rather than parsing HTML.

Allen Institute for Artificial Intelligence (Ai2), an AI company based in the United States, has announced MolmoWeb , an AI agent designed to operate and control web browsers.
MolmoWeb: An open agent for automating web tasks | Ai2
Ai2 Unveils MolmoWeb, an Open-Source Web Agent
https://theaieconomy.substack.com/p/ai2-molmoweb-molmowebmix-model-web-agent
MolmoWeb is a tool that leverages the multimodal capabilities of large-scale language models to read images, perform inferences, and execute tasks. Ai2 was introduced as 'interpreting the same interface that humans see, predicting the next step, and performing browser operations such as clicking, typing, and scrolling.'
A demo video is shown below.
MolmoWeb in Action - YouTube
The first task given was to 'search for Ai2 on Wikipedia and summarize the history of the PRIOR team.' MolmoWeb actually accessed Wikipedia, entered the word into the search box, and performed the search. They found the section that said PRIOR and compiled the information.

A distinctive feature is that each process performed by the AI is clearly recorded. For this particular task, the entry stated, 'The goal is to search for Ai2 and obtain information about the PRIOR team. Access Wikipedia and left-click on x=564.5, y=596.2.'

I tried searching for popular articles on the technology media outlet 'TechCrunch' using
Molmo Web
https://molmoweb.allen.ai/shared/994f1407-918e-4444-b986-33ed8d3e9453

Here's another task extracted from the demo video: 'Find a vacation rental in San Francisco that can accommodate two adults and one child from May 10th to May 15th.' MolmoWeb first accesses Airbnb and enters San Francisco. It opens the date setting screen, clicks the 'Next Month' icon, and opens the May dates.

You will select the number of people by actually using the website.

The search results were displayed and reported as 'completed.' From here, the user can assign additional tasks, such as 'Tell me the prices of the top two items.'

MolmoWeb supports operations such as navigating to URLs, clicking using screen coordinates, entering text into input fields, scrolling pages, switching browser tabs, and sending messages to users. Note that the demo version can only access websites on the whitelist.

MolmoWeb is a tool based on the Molmo 2 multimodal model family (4B and 8B parameters), characterized by its 'open' provision of weights, training data, and code. Along with the models, a large dataset called 'MolmoWebMix' that can be used for training web agents has also been released.
GitHub - allenai/molmoweb · GitHub
https://github.com/allenai/molmoweb
MolmoWeb - a allenai Collection
https://huggingface.co/collections/allenai/molmoweb
Ai2 explains, 'By designing to interpret visual information, it can interact with websites like a human, without relying on HTML or accessibility trees. A single screenshot is far more compact than source code and may consume fewer tokens during processing. Also, the visual interface remains stable even if the underlying page structure changes, and its behavior is easy to interpret and debug because the model infers the same interface as the user.'
A typical use case would be automating routine browser workflows, such as retrieving information from a website at a fixed time each week.
Several other demo videos have also been released. Below is a library that allows you to check the inference process.
MolmoWeb Inference Library - YouTube
Automated browser workflow.
Automatic web workflows with MolmoWeb - YouTube
Run Claude Code.
Using MolmoWeb as a Claude Code Skill - YouTube
MolmoWeb adapts to an unfamiliar task.
Adaptability of MolmoWeb - YouTube
This is the generation of synthetic data.
MolmoWeb: Generating Synthetic Data - YouTube
Related Posts:
in AI, Posted by log1p_kr








