Allen AI releases MolmoMotion for language-guided 3D motion

Aira Published Jun 17, 2026 · 3 min read

Allen AI releases MolmoMotion for language-guided 3D motion

Image: Hugging Face

The Allen Institute for AI released MolmoMotion on June 17, 2026, an open-source language-guided 3D motion forecasting model that predicts object-attached point trajectories from a single RGB frame and text instruction. The release includes two model variants, the 1.16-million-video MolmoMotion-1M dataset, and the 2,700-clip PointMotionBench benchmark under permissive Apache-2.0 licenses MolmoMotion announcement.

What is MolmoMotion and how does it work?

MolmoMotion uses Molmo 2 as its vision-language backbone to ground language to objects and query points, then forecasts future 3D trajectories in world coordinates. The autoregressive variant (MolmoMotion-AR) emits quantized coordinate tokens step by step for smooth, deterministic rollouts at 30 Hz inference speed on a single A100 GPU.

The flow-matching variant (MolmoMotion-FM) operates in continuous 3D space to capture multimodal uncertainty when an instruction admits multiple plausible futures, such as “pick up the cup” where grasp approaches vary. Both model checkpoints — 7B and 72B parameter versions — and the full training pipeline are available on Hugging Face and GitHub.

Representation built for downstream use

Rather than predicting pixels or dense depth maps, MolmoMotion represents motion as sparse, object-attached 3D points in a shared world frame.

The team chose this representation because it satisfies three constraints simultaneously: it is class-agnostic (no templates for hands, rigid bodies, or deformable categories), view-stable (trajectories remain consistent across camera motion), and directly consumable by robot policies or trajectory-conditioned video generators without an intermediate perception step.

A sparse point set of 64 query points can describe rigid, articulated, and limited deformable motion while staying compact enough for real-time planning loops at 10 ms per inference step technical report.

MolmoMotion-1M and PointMotionBench fill a data gap

Training a forecaster of this generality required a corpus of object-anchored 3D trajectories linked to natural-language action descriptions — data that did not exist at scale. Ai2 built an automatic annotation pipeline that extracts object-grounded 3D trajectories in metric world coordinates from 1.16 million unconstrained internet videos, producing the largest such collection to date.

The pipeline uses DINOv2 features for object tracking, COLMAP for camera pose estimation, and a language model to generate action descriptions from visual changes. The companion PointMotionBench benchmark comprises 2,700 human-validated clips designed to measure object-centric 3D forecasting accuracy across 47 object categories, 12 motion types, and 8 language templates dataset page.

Two forecasting heads for different uncertainty profiles

MolmoMotion-AR treats future coordinates as structured text tokens, following the coordinate-style prediction pattern established by vision-language models. Conditioning each new coordinate on the trajectory so far encourages smooth rollouts and yields the strongest accuracy when the future path is well-defined — achieving 12.3 cm average displacement error (ADE) at 2-second horizon on PointMotionBench.

MolmoMotion-FM instead uses flow matching with 50 diffusion steps to transform noise into continuous 3D trajectories, making it better suited for instructions like “pick up the cup” where multiple valid grasp approaches exist — reducing mode collapse by 34% compared to deterministic baselines. Researchers can select the variant that matches their downstream uncertainty requirements project page.

Implications for robotics and generative video

Because the output is explicit 3D point trajectories in world space, MolmoMotion plugs directly into model-based robot planners that need to anticipate object motion before contact — a longstanding gap between perception and control.

The same trajectories can condition video diffusion models like SVD-XT to generate physically plausible frames that respect the forecasted motion, offering a new control knob for controllable video generation.

Early experiments cited in the technical report demonstrate both use cases: a Franka Emika Panda arm achieving 78% success rate on language-conditioned pick-and-place with MolmoMotion-AR forecasts, and a 22% FVD improvement on EPIC-Kitchens video generation when conditioning on MolmoMotion-FM trajectories versus text-only prompts.

For more on vision-language models in robotics, see our analysis of Google Robotics Transformer 2 (RT-2) and the shift to web-scale VLAs. For dataset-scale trends, read How synthetic data is reshaping embodied AI benchmarks. For video generation control, compare with Runway Gen-3 Alpha and the rise of trajectory-conditioned diffusion.

Bottom line: MolmoMotion gives builders an open, language-steerable 3D forecaster with two uncertainty-aware variants (AR at 12.3 cm ADE, FM with 34% less mode collapse), a 1.16-million-video Apache-2.0 training set, and a 2,700-clip validated benchmark — lowering the barrier to adding anticipatory motion reasoning to robot policies and video generators today.

#AI research #computer-vision #Google #Hugging Face #Open Source #robotics

Editorially independent: we accept no payment for coverage and currently use no affiliate links. Read our Editorial Standards and Corrections Policy. Published: Jun 17, 2026.

Allen AI releases MolmoMotion for language-guided 3D motion

What is MolmoMotion and how does it work?

Representation built for downstream use

MolmoMotion-1M and PointMotionBench fill a data gap

Two forecasting heads for different uncertainty profiles

Implications for robotics and generative video

Read next

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

Copilot in Visual Studio Adds Agent Preview and Built-In Skills

Claude Shared Chats Were Showing Up in Google Search

The zBrandco Edition

Allen AI releases MolmoMotion for language-guided 3D motion

What is MolmoMotion and how does it work?

Representation built for downstream use

MolmoMotion-1M and PointMotionBench fill a data gap

Two forecasting heads for different uncertainty profiles

Implications for robotics and generative video

Allen AI Releases MolmoMotion For Language: Related zbrandco coverage

Read next

Use GPT-5.6 Sol, Terra, and Luna on Amazon Bedrock

Copilot in Visual Studio Adds Agent Preview and Built-In Skills

Claude Shared Chats Were Showing Up in Google Search

The zBrandco Edition