Ai2 has released 'MolmoWeb,' a visual web AI agent that operates using browser screenshots rather than parsing HTML.



Allen Institute for Artificial Intelligence (Ai2), an AI company based in the United States, has announced MolmoWeb , an AI agent designed to operate and control web browsers.

MolmoWeb: An open agent for automating web tasks | Ai2

https://allenai.org/blog/molmoweb

Ai2 Unveils MolmoWeb, an Open-Source Web Agent
https://theaieconomy.substack.com/p/ai2-molmoweb-molmowebmix-model-web-agent

MolmoWeb is a tool that leverages the multimodal capabilities of large-scale language models to read images, perform inferences, and execute tasks. Ai2 was introduced as 'interpreting the same interface that humans see, predicting the next step, and performing browser operations such as clicking, typing, and scrolling.'

A demo video is shown below.

MolmoWeb in Action - YouTube


The first task given was to 'search for Ai2 on Wikipedia and summarize the history of the PRIOR team.' MolmoWeb actually accessed Wikipedia, entered the word into the search box, and performed the search. They found the section that said PRIOR and compiled the information.



A distinctive feature is that each process performed by the AI is clearly recorded. For this particular task, the entry stated, 'The goal is to search for Ai2 and obtain information about the PRIOR team. Access Wikipedia and left-click on x=564.5, y=596.2.'



I tried searching for popular articles on the technology media outlet 'TechCrunch' using

the demo version of MolmoWeb . The 'Action Description' shows 'Access https://techcrunch.com', and the 'Thinking Process' shows 'The user wants to find the top 3 articles from TechCrunch's latest rankings. Let's start by accessing TechCrunch'.

Molmo Web
https://molmoweb.allen.ai/shared/994f1407-918e-4444-b986-33ed8d3e9453



Here's another task extracted from the demo video: 'Find a vacation rental in San Francisco that can accommodate two adults and one child from May 10th to May 15th.' MolmoWeb first accesses Airbnb and enters San Francisco. It opens the date setting screen, clicks the 'Next Month' icon, and opens the May dates.



Select the correct date here.



You will select the number of people by actually using the website.



The search results were displayed and reported as 'completed.' From here, the user can assign additional tasks, such as 'Tell me the prices of the top two items.'



MolmoWeb supports operations such as navigating to URLs, clicking using screen coordinates, entering text into input fields, scrolling pages, switching browser tabs, and sending messages to users. Note that the demo version can only access websites on the whitelist.



MolmoWeb is a tool based on the Molmo 2 multimodal model family (4B and 8B parameters), characterized by its 'open' provision of weights, training data, and code. Along with the models, a large dataset called 'MolmoWebMix' that can be used for training web agents has also been released.

GitHub - allenai/molmoweb · GitHub
https://github.com/allenai/molmoweb

MolmoWeb - a allenai Collection
https://huggingface.co/collections/allenai/molmoweb

Ai2 explains, 'By designing to interpret visual information, it can interact with websites like a human, without relying on HTML or accessibility trees. A single screenshot is far more compact than source code and may consume fewer tokens during processing. Also, the visual interface remains stable even if the underlying page structure changes, and its behavior is easy to interpret and debug because the model infers the same interface as the user.'

A typical use case would be automating routine browser workflows, such as retrieving information from a website at a fixed time each week.

Several other demo videos have also been released. Below is a library that allows you to check the inference process.

MolmoWeb Inference Library - YouTube


Automated browser workflow.

Automatic web workflows with MolmoWeb - YouTube


Run Claude Code.

Using MolmoWeb as a Claude Code Skill - YouTube


MolmoWeb adapts to an unfamiliar task.

Adaptability of MolmoWeb - YouTube


This is the generation of synthetic data.

MolmoWeb: Generating Synthetic Data - YouTube


in AI, Posted by log1p_kr