The Allen Institute for AI released MolmoMotion on June 17, 2026, an open-source language-guided 3D motion forecasting model that predicts object-attached point trajectories from a single RGB frame and text instruction. The release includes two model variants, the 1.16-million-video MolmoMotion-1M dataset, and the 2,700-clip PointMotionBench benchmark under permissive Apache-2.0 licenses MolmoMotion announcement.
What is MolmoMotion and how does it work?
MolmoMotion uses Molmo 2 as its vision-language backbone to ground language to objects and query points, then forecasts future 3D trajectories in world coordinates. The autoregressive variant (MolmoMotion-AR) emits quantized coordinate tokens step by step for smooth, deterministic rollouts at 30 Hz inference speed on a single A100 GPU.
The flow-matching variant (MolmoMotion-FM) operates in continuous 3D space to capture multimodal uncertainty when an instruction admits multiple plausible futures, such as “pick up the cup” where grasp approaches vary. Both model checkpoints — 7B and 72B parameter versions — and the full training pipeline are available on Hugging Face and GitHub.
Representation built for downstream use
Rather than predicting pixels or dense depth maps, MolmoMotion represents motion as sparse, object-attached 3D points in a shared world frame.
The team chose this representation because it satisfies three constraints simultaneously: it is class-agnostic (no templates for hands, rigid bodies, or deformable categories), view-stable (trajectories remain consistent across camera motion), and directly consumable by robot policies or trajectory-conditioned video generators without an intermediate perception step.
A sparse point set of 64 query points can describe rigid, articulated, and limited deformable motion while staying compact enough for real-time planning loops at 10 ms per inference step technical report.
MolmoMotion-1M and PointMotionBench fill a data gap
Training a forecaster of this generality required a corpus of object-anchored 3D trajectories linked to natural-language action descriptions — data that did not exist at scale. Ai2 built an automatic annotation pipeline that extracts object-grounded 3D trajectories in metric world coordinates from 1.16 million unconstrained internet videos, producing the largest such collection to date.
The pipeline uses DINOv2 features for object tracking, COLMAP for camera pose estimation, and a language model to generate action descriptions from visual changes. The companion PointMotionBench benchmark comprises 2,700 human-validated clips designed to measure object-centric 3D forecasting accuracy across 47 object categories, 12 motion types, and 8 language templates dataset page.
Two forecasting heads for different uncertainty profiles
MolmoMotion-AR treats future coordinates as structured text tokens, following the coordinate-style prediction pattern established by vision-language models. Conditioning each new coordinate on the trajectory so far encourages smooth rollouts and yields the strongest accuracy when the future path is well-defined — achieving 12.3 cm average displacement error (ADE) at 2-second horizon on PointMotionBench.
MolmoMotion-FM instead uses flow matching with 50 diffusion steps to transform noise into continuous 3D trajectories, making it better suited for instructions like “pick up the cup” where multiple valid grasp approaches exist — reducing mode collapse by 34% compared to deterministic baselines. Researchers can select the variant that matches their downstream uncertainty requirements project page.
Implications for robotics and generative video
Because the output is explicit 3D point trajectories in world space, MolmoMotion plugs directly into model-based robot planners that need to anticipate object motion before contact — a longstanding gap between perception and control.
The same trajectories can condition video diffusion models like SVD-XT to generate physically plausible frames that respect the forecasted motion, offering a new control knob for controllable video generation.
Early experiments cited in the technical report demonstrate both use cases: a Franka Emika Panda arm achieving 78% success rate on language-conditioned pick-and-place with MolmoMotion-AR forecasts, and a 22% FVD improvement on EPIC-Kitchens video generation when conditioning on MolmoMotion-FM trajectories versus text-only prompts.
FAQ: MolmoMotion common questions
What license covers MolmoMotion models and data?
All model weights, training code, the MolmoMotion-1M dataset, and PointMotionBench are released under Apache-2.0, permitting commercial use, modification, and redistribution.
How does MolmoMotion differ from prior 3D forecasting work?
Prior methods like EgoMotion or HMR focus on human body motion or require dense depth input. MolmoMotion is the first to forecast sparse object-attached 3D points from a single RGB frame plus language, enabling class-agnostic, view-stable predictions for arbitrary objects.
Can I fine-tune MolmoMotion on my own robot data?
Yes. The GitHub repository includes LoRA fine-tuning scripts configured for 7B and 72B checkpoints. Fine-tuning on 500 domain-specific trajectories takes approximately 4 hours on 8×A100 GPUs.
What hardware is required for inference?
MolmoMotion-AR 7B runs at 30 Hz on a single 24 GB GPU (RTX 3090 or A10G). The 72B variant requires 80 GB VRAM (A100 80GB or H100) for real-time performance.
Allen AI Releases MolmoMotion For Language: Related zbrandco coverage
For more on vision-language models in robotics, see our analysis of Google Robotics Transformer 2 (RT-2) and the shift to web-scale VLAs. For dataset-scale trends, read How synthetic data is reshaping embodied AI benchmarks. For video generation control, compare with Runway Gen-3 Alpha and the rise of trajectory-conditioned diffusion.
Bottom line: MolmoMotion gives builders an open, language-steerable 3D forecaster with two uncertainty-aware variants (AR at 12.3 cm ADE, FM with 34% less mode collapse), a 1.16-million-video Apache-2.0 training set, and a 2,700-clip validated benchmark — lowering the barrier to adding anticipatory motion reasoning to robot policies and video generators today.
