NVIDIA’s Cosmos Policy shows how world foundation models can transform robotics by giving machines the ability to predict outcomes, plan ahead, and act strategically within a unified decision-making framework. (Source: Image by RR)

Robot Actions and Outcomes Are Learned as a Single Temporal Process

NVIDIA has unveiled Cosmos Policy, a new robot control framework that aims to standardize how machines decide what actions to take by leveraging large-scale world models. The approach, as noted in interestingengineering.com, builds on NVIDIA’s broader push toward physical AI, where robots understand and predict how the physical world evolves over time rather than relying on narrowly trained, task-specific control systems. Cosmos Policy adapts pretrained video prediction models to serve as a unified decision-making layer for robotics.

Traditionally, robotic control systems rely on separate components for perception, planning, and action, each requiring extensive labeled data and custom tuning for specific robots or environments. Cosmos Policy replaces this fragmented approach by post-training Cosmos Predict, a video world model already trained on massive amounts of visual data. By treating robot actions, physical states, and task outcomes as part of a single temporal representation, the model learns to predict not only what the robot should do next, but also what will happen as a result.

Benchmark results show that this unified approach delivers strong performance on complex, multi-step manipulation tasks. In several cases, Cosmos Policy matched or exceeded existing methods while requiring far fewer training demonstrations. This data efficiency is especially valuable in robotics, where collecting real-world training data is expensive and time-consuming. By reusing knowledge embedded in large video models, NVIDIA significantly lowers the barrier to training reliable robotic behaviors.

A key differentiator of Cosmos Policy is its built-in planning capability at inference time. Rather than selecting only the next immediate action, the model can generate and evaluate multiple action sequences by predicting their future outcomes and expected success. Real-world experiments, including bimanual manipulation tasks driven directly from visual input, suggest the framework can transfer beyond simulation. As part of NVIDIA’s growing Cosmos ecosystem, the technology highlights an industry-wide push to standardize the intelligence layer that connects AI reasoning to physical action.

read more at interestingengineering.com