RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space
RAYNOVA is a 4D world foundation model that unifies space and time in ray space, enabling multiview, long-horizon video generation across diverse camera setups without explicit 3D reconstruction.
May 29, 2026 • 4 min read
World models are not just about generating videos. They are about understanding and simulating how the world evolves. The real world is inherently 4D: Multiple cameras observe a shared 3D world that changes continuously over time. Vehicles are moving, the ego is rotating and shifting, and camera setups differ across fleets. A practical world model must reason across all of this without being fragile to specific sensor layouts or geometric assumptions.
In this post, we introduce RAYNOVA, a 4D world foundation model that unifies space and time in a single representation using a pure auto-regressive framework. Instead of separating spatial reasoning from temporal modeling, it learns to autoregress over both dimensions simultaneously. This results in a scalable and flexible model that can generate multi-view, long-horizon videos under diverse camera setups without relying on explicit 3D scene reconstruction.
Why World Models Matter
Traditional video generation models focus on producing visually realistic sequences. They often assume a fixed camera setup, a single viewpoint with only slight motion, and strong temporal continuity per camera. While these approaches produce impressive videos, they do not necessarily learn a generalizable model of the world. In contrast, world foundation models aim for something more ambitious that simulates physically plausible multiview scenes under diverse camera motions, changing perspective views, and multiple input conditions. This is fundamentally different from text-to-video generation.
To achieve this, separating spatial and temporal modeling is no longer sufficient. The world does not evolve in “space first, time later” but in an inseparable continuous 4D space. This motivates a unified spatio-temporal representation where space and time are treated jointly, rather than as separate modules.
Rethinking Geometry: Beyond Explicit 3D Priors
Many existing world models enforce the spatio-temporal consistency by constructing explicit 3D representations such as point clouds, occupancy grid, latent volumes, or 3D Gaussians. These approaches introduce strong geometric inductive biases. While effective within constrained domains, they often depend on specific camera overlaps, require auxiliary supervision (depth, flow, lidar), and limit generalization beyond the training distribution.
RAYNOVA takes a different path. Instead of forcing a particular 3D structure, it represents the token position in camera ray space—a representation that naturally connects scales, views, and frames without explicitly constructing a 3D scene graph. This allows the model to:
Generalize to unseen camera configurations
Handle arbitrary camera rotations and shifts
Support heterogeneous training data
Remain data-driven rather than geometry-constrained
Method: Ray-Level 4D Relative Position Embedding
To reason across views and time, the model must know how tokens relate to each other spatially and temporally. Without any global coordinate frame, RAYNOVA instead uses relative positions in camera ray space. It encodes how tokens relate to each other across views, frames, and scales with minimal handcrafted geometric bias. Since it is relative rather than absolute, the model does not memorize a specific world layout. As a result, RAYNOVA is designed as a scalable data-driven framework that supports heterogeneous training data, unseen camera configurations and extrapolation beyond training range.
Method: Dual-Causal Autoregression
RAYNOVA applies a pure autoregressive architecture with discrete tokens free of any diffusion modules. It auto-regresses along two dimensions simultaneously: scale and time.
Instead of generating each image token by token, RAYNOVA generates images scale by scale: first coarse structure then fine-grained details. This hierarchical “next-scale prediction” strategy allows the model to capture global structure first and add details progressively. At the same time, the model autoregresses across frames. Importantly, it neither assumes each camera evolves independently nor enforces adjacency constraints between cameras. Instead, the current frame is conditioned on all views from previous frames. This creates a unified temporal reasoning process across multi-view inputs.
By combining both, RAYNOVA forms a topological order over the 4D world, generating coarse-to-fine structure frame by frame with multiview consistency within a single transformer architecture. This design enables efficient long-horizon video generation with flexible frame rate.
Results
RAYNOVA delivers state-of-the-art performance in multiview video generation while significantly improving the latency over existing world models in autonomous driving scenarios. The synthetic multiview images and videos support multiple control signals to generate diverse synthetic data with high fidelity to input conditions such as text, objects, and maps.
Conditional Scene Generation
Multi Video Generation
Thanks to the flexible ray-level relative position embedding, RAYNOVA is able to synthesize novel views with different camera shifts, rotations, and fields-of-views. It can even support scenes with novel camera configurations that are unseen in the training data.
Camera rotation
Camera shift
More details can be found on the project website and paper, as well as our upcoming presentation in CVPR 2026.