MolmoAct 2

What is MolmoAct 2?

MolmoAct 2 is a fully open robotics foundation model designed to support robot action reasoning in real-world environments. It focuses on tasks that require the system to reason about an environment in 3D before acting, aiming to reduce the need for per-task fine-tuning in common manipulation settings.

In addition to the model, the release includes the MolmoAct 2-Bimanual YAM dataset and an updated VLA pipeline with a novel adapter architecture. Together, these are intended for researchers who want to study, reproduce, and build on action reasoning for manipulation and other embodied-reasoning benchmarks.

Key Features

Action Reasoning Model (ARM) for 3D before acting: MolmoAct 2 reasons about its environment in 3D prior to taking action, targeting improved performance on embodied-reasoning evaluation tasks.
Designed for real-world deployment scenarios: The model is presented as being built for real-world environments, not only for benchmark validation.
Upgraded open reasoning backbone (Molmo 2-ER): MolmoAct 2 is based on Molmo 2-ER, a specialized embodied-reasoning variant of Molmo 2 further trained on additional embodied-reasoning examples (including image- and video-based spatial question answering).
Faster inference than the predecessor: The release reports MolmoAct 2 runs up to 37× faster than its predecessor.
Open research package: The release makes available the model weights, datasets, and the described adaptive reasoning approach used to boost reasoning depth and interpretability.
Large bimanual dataset for manipulation research: The MolmoAct 2-Bimanual YAM dataset is reported as the largest open-source bimanual tabletop manipulation dataset, with over 720 hours of training demonstrations.

How to Use MolmoAct 2

Get the open release artifacts: Download the MolmoAct 2 model weights and related assets provided in the release for researchers.
Use the updated VLA pipeline: Start with the updated pipeline that uses the described novel adapter architecture.
Train/evaluate using the provided dataset(s): For bimanual tabletop manipulation experiments, use MolmoAct 2-Bimanual YAM; for other embodied-reasoning experiments, follow the release’s research-focused guidance around the adaptive reasoning approach.
Apply adaptive 3D reasoning: Use the adaptive reasoning method described with the release to encourage deeper 3D reasoning where it improves performance.

Use Cases

Studying action reasoning for manipulation: Researchers can investigate how 3D action reasoning affects performance on tasks that involve contacting, grasping, and manipulating objects in tabletop setups.
Benchmark reproduction across embodied-reasoning tasks: The release reports evaluation across 13 embodied-reasoning benchmarks (e.g., pointing, multi-image reasoning, ego-exo correspondence, video spatial reasoning), enabling comparative study.
Bimanual tabletop research: Teams working on two-arm manipulation can use the MolmoAct 2-Bimanual YAM dataset (over 720 hours of demonstrations) to train and evaluate bimanual policies.
Research on open model architectures: The open foundation-model setting allows researchers to examine and modify model components (e.g., reasoning backbone and adapter architecture) rather than relying on closed systems.
Developing systems that reduce per-task fine-tuning: Because MolmoAct 2 is described as handling various real-world tasks out of the box, it can be used as a starting point for work aimed at lowering customization costs.

FAQ

Is MolmoAct 2 intended for research or production deployments? The release is explicitly positioned as available for researchers to study and build on, while also describing MolmoAct 2 as built to deploy in real-world environments.
What dataset is included for bimanual manipulation? The release includes MolmoAct 2-Bimanual YAM, described as the largest open-source bimanual tabletop manipulation dataset, with over 720 hours of training demonstrations.
What makes MolmoAct 2 different from the earlier MolmoAct? The update includes a stronger reasoning backbone (Molmo 2-ER), and the release reports MolmoAct 2 runs up to 37× faster than its predecessor.
Does the model require per-task fine-tuning? The release states that MolmoAct 2 can handle various real-world tasks out of the box without per-task fine-tuning.
What is the adaptive reasoning approach mentioned in the release? The page states that the release includes an adaptive reasoning approach intended to help MolmoAct 2 reason more deeply in 3D to boost performance and interpretability.

Alternatives

Closed robotics foundation models: Some teams release weights but fewer release data; these alternatives may limit how researchers can study data, reproduce results, or modify components.
Action or vision-language models used for embodied tasks with separate tooling: Instead of a dedicated action-reasoning foundation model, some teams may combine general-purpose vision-language models with downstream robotic control stacks; this differs in workflow because reasoning and action may be handled by separate components.
Other open robotics datasets for manipulation: If the primary need is data rather than a particular model architecture, researchers can use open manipulation datasets and train policies using their own model/backbone choices.
Embodied reasoning benchmarks and training pipelines: Another approach is to focus on benchmark-driven training/evaluation pipelines for embodied-reasoning tasks; this differs by emphasizing evaluation methodology and experimentation setup over a specific open foundation model release.

MolmoAct 2

What is MolmoAct 2?

Key Features

How to Use MolmoAct 2

Use Cases

FAQ

Alternatives

Alternatives

AakarDev AI

BookAI.chat

skills-janitor

FeelFish

BenchSpan

ChatBA