UStackUStack
MolmoAct 2 icon

MolmoAct 2

MolmoAct 2 is a fully open robotics foundation model for 3D action reasoning, released with the MolmoAct 2-Bimanual YAM dataset for reproducible research.

MolmoAct 2

What is MolmoAct 2?

MolmoAct 2 is a fully open robotics foundation model designed to support robot action reasoning in real-world environments. It focuses on tasks that require the system to reason about an environment in 3D before acting, aiming to reduce the need for per-task fine-tuning in common manipulation settings.

In addition to the model, the release includes the MolmoAct 2-Bimanual YAM dataset and an updated VLA pipeline with a novel adapter architecture. Together, these are intended for researchers who want to study, reproduce, and build on action reasoning for manipulation and other embodied-reasoning benchmarks.

Key Features

  • Action Reasoning Model (ARM) for 3D before acting: MolmoAct 2 reasons about its environment in 3D prior to taking action, targeting improved performance on embodied-reasoning evaluation tasks.
  • Designed for real-world deployment scenarios: The model is presented as being built for real-world environments, not only for benchmark validation.
  • Upgraded open reasoning backbone (Molmo 2-ER): MolmoAct 2 is based on Molmo 2-ER, a specialized embodied-reasoning variant of Molmo 2 further trained on additional embodied-reasoning examples (including image- and video-based spatial question answering).
  • Faster inference than the predecessor: The release reports MolmoAct 2 runs up to 37× faster than its predecessor.
  • Open research package: The release makes available the model weights, datasets, and the described adaptive reasoning approach used to boost reasoning depth and interpretability.
  • Large bimanual dataset for manipulation research: The MolmoAct 2-Bimanual YAM dataset is reported as the largest open-source bimanual tabletop manipulation dataset, with over 720 hours of training demonstrations.

How to Use MolmoAct 2

  1. Get the open release artifacts: Download the MolmoAct 2 model weights and related assets provided in the release for researchers.
  2. Use the updated VLA pipeline: Start with the updated pipeline that uses the described novel adapter architecture.
  3. Train/evaluate using the provided dataset(s): For bimanual tabletop manipulation experiments, use MolmoAct 2-Bimanual YAM; for other embodied-reasoning experiments, follow the release’s research-focused guidance around the adaptive reasoning approach.
  4. Apply adaptive 3D reasoning: Use the adaptive reasoning method described with the release to encourage deeper 3D reasoning where it improves performance.

Use Cases

  • Studying action reasoning for manipulation: Researchers can investigate how 3D action reasoning affects performance on tasks that involve contacting, grasping, and manipulating objects in tabletop setups.
  • Benchmark reproduction across embodied-reasoning tasks: The release reports evaluation across 13 embodied-reasoning benchmarks (e.g., pointing, multi-image reasoning, ego-exo correspondence, video spatial reasoning), enabling comparative study.
  • Bimanual tabletop research: Teams working on two-arm manipulation can use the MolmoAct 2-Bimanual YAM dataset (over 720 hours of demonstrations) to train and evaluate bimanual policies.
  • Research on open model architectures: The open foundation-model setting allows researchers to examine and modify model components (e.g., reasoning backbone and adapter architecture) rather than relying on closed systems.
  • Developing systems that reduce per-task fine-tuning: Because MolmoAct 2 is described as handling various real-world tasks out of the box, it can be used as a starting point for work aimed at lowering customization costs.

FAQ

  • Is MolmoAct 2 intended for research or production deployments? The release is explicitly positioned as available for researchers to study and build on, while also describing MolmoAct 2 as built to deploy in real-world environments.

  • What dataset is included for bimanual manipulation? The release includes MolmoAct 2-Bimanual YAM, described as the largest open-source bimanual tabletop manipulation dataset, with over 720 hours of training demonstrations.

  • What makes MolmoAct 2 different from the earlier MolmoAct? The update includes a stronger reasoning backbone (Molmo 2-ER), and the release reports MolmoAct 2 runs up to 37× faster than its predecessor.

  • Does the model require per-task fine-tuning? The release states that MolmoAct 2 can handle various real-world tasks out of the box without per-task fine-tuning.

  • What is the adaptive reasoning approach mentioned in the release? The page states that the release includes an adaptive reasoning approach intended to help MolmoAct 2 reason more deeply in 3D to boost performance and interpretability.

Alternatives

  • Closed robotics foundation models: Some teams release weights but fewer release data; these alternatives may limit how researchers can study data, reproduce results, or modify components.
  • Action or vision-language models used for embodied tasks with separate tooling: Instead of a dedicated action-reasoning foundation model, some teams may combine general-purpose vision-language models with downstream robotic control stacks; this differs in workflow because reasoning and action may be handled by separate components.
  • Other open robotics datasets for manipulation: If the primary need is data rather than a particular model architecture, researchers can use open manipulation datasets and train policies using their own model/backbone choices.
  • Embodied reasoning benchmarks and training pipelines: Another approach is to focus on benchmark-driven training/evaluation pipelines for embodied-reasoning tasks; this differs by emphasizing evaluation methodology and experimentation setup over a specific open foundation model release.