UStackUStack
MolmoWeb icon

MolmoWeb

MolmoWeb is an open visual web agent that completes browser tasks from screenshots alone, released with MolmoWebMix and training/evaluation tools.

MolmoWeb

What is MolmoWeb?

MolmoWeb is an open visual web agent that automates browser tasks by interpreting the live webpage through screenshots. Given a task instruction, a Molmo model observes the current screen, decides on the next step, and executes browser actions such as clicking, typing, or scrolling.

It is designed as a self-hosted system (locally or on cloud services) and is released alongside model weights, a dataset for training web agents (MolmoWebMix), and the evaluation and tooling needed to reproduce, fine-tune, and assess web-agent behavior.

Key Features

  • Open visual web agent built on the Molmo 2 multimodal model family (available in 4B and 8B sizes), with weights and training-related assets for experimentation.
  • Screenshot-based browser control loop: the agent receives a task instruction, a screenshot of the current browser view, and recent action history, then outputs a next browser action.
  • Browser actions matched to visual interfaces: supports navigating to URLs, clicking at screen coordinates, typing into fields, scrolling, opening/switching tabs, and sending messages back to the user.
  • Open training and evaluation tooling released in the MolmoWeb repository, including:
    • Training code for customizing MolmoWeb to specific applications.
    • An annotation tool to record human task demonstrations and fine-tune on that data.
    • An evaluation harness for navigation benchmarks (WebVoyager, Online-Mind2Web, WebTailBench, Deepshop).
  • Data and dataset release support:
    • MolmoWebMix dataset for training web agents.
    • A synthetic data generation pipeline within the tooling that can generate web browsing data using LLM-/VLM-powered agents with AxTree/screenshot input.

How to Use MolmoWeb

  1. Start from the MolmoWeb GitHub repository to obtain the released assets and tools, including the training code, evaluation harness, and other components described in the update.
  2. Use the annotation collection tool (if you want domain-specific behavior) to record human task demonstrations, then fine-tune MolmoWeb using the provided training code.
  3. Evaluate your agent runs with the included evaluation harness against the supported navigation benchmarks.
  4. For interactive inspection, use the client-side code for the MolmoWeb demo to enter a task and observe the agent navigating websites in real time.

Use Cases

  • Reproducing and evaluating web-agent performance: run MolmoWeb with the evaluation harness on common navigation benchmarks such as WebVoyager, Online-Mind2Web, WebTailBench, or Deepshop.
  • Fine-tuning for a new domain with human demonstrations: use the annotation tool to record task demonstrations relevant to your website or workflow, then fine-tune MolmoWeb on that collected data.
  • Building a custom web-agent UI: take the released client-side demo code as a starting point to create your own interface for sending tasks to an agent and viewing browser navigation.
  • Generating training data for web browsing: use the included synthetic data generation pipeline to produce browsing trajectories, leveraging LLM- and VLM-powered agents with AxTree/screenshot input.
  • Researching open web-agent pipelines end-to-end: use the combination of dataset (MolmoWebMix), training code, and evaluation tooling to inspect and improve multiple parts of the stack (data collection, training, and benchmarking).

FAQ

Was the initial training dataset released on Hugging Face updated? Yes. The page notes that if you previously downloaded the training data from Hugging Face, you should redownload because the datasets were updated since the initial release.

What kinds of actions can MolmoWeb perform in the browser? The source describes support for navigating to URLs, clicking at screen coordinates, typing text, scrolling, opening or switching browser tabs, and sending a message back to the user.

How does MolmoWeb decide what to do next? At each step, it uses the task instruction, a screenshot of the current browser view, and recent action history to produce a next browser action.

What is MolmoWebMix? MolmoWebMix is described as a large and diverse dataset for training web agents, released alongside a complete training and evaluation pipeline.

What does the evaluation harness include? The evaluation harness is described as tooling to evaluate web agents like MolmoWeb on navigation benchmarks including WebVoyager, Online-Mind2Web, WebTailBench, and Deepshop.

Alternatives

  • Proprietary web-agent platforms: these may offer turnkey automation, but they typically rely on undisclosed training data and methods, unlike MolmoWeb’s open model/data/code approach.
  • Screenshot-based browser automation agents built from other multimodal models: these can also use visual inputs to drive browser actions, but may differ in available weights, datasets, and evaluation tooling.
  • General-purpose browser automation frameworks (rule-based or script-driven): these can automate specific workflows without learning from demonstrations or benchmarks, but generally require more predefined logic.
  • Custom agent pipelines focused on structured page representations (HTML/accessibility trees): instead of screenshots, they use structured representations, changing how perception and action are connected.