S2GO: Streaming Sparse Gaussian Occupancy — ICLR 2026 Publication

February 19, 2026
1 min read

Modern vision systems don’t just need to detect objects—they need to reason about the geometry of the world around them: what areas are free to drive, what’s occupied, and how the scene evolves over time.

The challenge is that most 3D occupancy methods are heavy. They rely on dense voxel grids or huge collections of 3D Gaussians, which quickly become slow and hard to scale in continuous operation. That makes it difficult to maintain situational awareness over time without expensive computation.

S2GO (Streaming Sparse Gaussian Occupancy) takes a different approach. Instead of rebuilding the world from scratch at every timestep, it maintains a small number of evolving 3D queries that summarize the scene and stream forward as the vehicle moves. With this compact streaming representation, S2GO can deliver dense, high‑quality 3D occupancy predictions up to 5.9× faster than prior methods—making real‑time performance possible even on a single GPU.

In this post, we’ll explain the motivation behind S2GO, how the streaming query framework works, and why this approach points to a more scalable path for real‑time, camera‑only 3D perception.

Why Occupancy Prediction Matters

For autonomous vehicles, understanding where it’s safe to drive is just as important as recognizing what is in view. Real-world driving scenes contain complex geometry—parked cars, uneven curbs, construction barriers, and objects that may be partially hidden. A good perception system needs to model not just discrete detections with known categories, but the full 3D layout of the world, including uncertain or unseen regions in an open vocabulary.

Occupancy prediction provides that dense spatial understanding. Instead of limiting perception to boxed detections or semantic maps, it answers a fundamental question: what space is free, and what is not? This representation captures fine-grained geometry and can naturally handle unknown objects, motion blur, and occlusions—making it a valuable complement to traditional vision outputs.

The difficulty is efficiency. Dense 3D representations come with steep computational costs, particularly when maintaining stable predictions over time. Voxel grids compute everywhere—even in empty space—while dense Gaussian methods concentrate on occupied regions but still require tens of thousands of primitives. These designs make it hard to achieve both long temporal horizons and real‑time speed.

S2GO is motivated by a simple question: can we represent the 3D world just as effectively, but with far less compute? By rethinking what information should persist as the vehicle moves, S2GO bridges the gap between accuracy and efficiency—delivering consistent, long‑term scene understanding without the usual cost of dense 3D storage.

What is Streaming Occupancy?

Instead of treating each frame as a separate 3D prediction problem, streaming occupancy treats perception as a continuous process unfolding over time. Rather than rebuilding a full occupancy grid from scratch at every timestep, the model carries a compact state forward and incrementally updates it as new images arrive. This better matches how driving actually works: the ego vehicle moves, other actors move, and occlusions constantly change what is visible from the cameras.

Maintaining a streaming state is not just about saving computation; it is about temporal consistency. When the model maintains continuity across frames, it can stabilize geometry across frames, preserve structure that becomes temporarily occluded, and reduce the frame‑to‑frame flicker that appears when each scene is estimated independently. This continuity makes it easier to keep objects separated, avoid drift, and maintain a coherent view of the world over longer horizons—properties that downstream planning and control modules rely on when reasoning about free space and obstacles.

S2GO implements streaming occupancy by summarizing the scene into a small set of 3D queries that persist and evolve over time. These queries act as a lightweight world state: they encode accumulated context from previous frames, are refined with new camera inputs, and are decoded into dense semantic occupancy at each timestep. This query-based streaming design allows S2GO to capture long-term temporal information while staying efficient enough for real-time deployment on a single GPU.

Method: Sparse Queries as a Streaming World State

S2GO represents the 3D scene with a compact set of sparse 3D queries that act as a persistent world state, rather than storing a dense voxel grid or a massive collection of Gaussians over time. At any given timestep, the model maintains roughly a thousand queries that summarize the current scene, each with a 3D position and associated features encoding local context. Because this state is small and fixed in size, it can be propagated through long temporal horizons without scaling compute with the volume of space being observed.

At each frame, S2GO takes the latest multi-camera images together with a queue of queries from previous timesteps and refines the current queries using a temporal transformer. This module lets queries interact with both historical state and new visual evidence, so they can move, update their features, and track how the scene is evolving. After refinement, each query is decoded into a small set of fine-grained Gaussians, forming a hierarchical representation where the query anchors a region and its Gaussians capture fine-grained local structure within that region. 

These semantic Gaussians are then converted into a dense occupancy grid using an efficient Gaussian-to-voxel splatting operation. The result is a full 3D semantic occupancy prediction that matches standard benchmarks, while the underlying streaming state remains lightweight and query-based.

This design creates a practical balance: the sparse world state enables rich temporal reasoning and long history at low cost, and the Gaussian output preserves the geometric detail and irregular shapes needed for high-fidelity 3D understanding.

Geometry-First Pretraining

Using sparse queries for dense 3D occupancy is powerful but also challenging: unlike voxels, queries are not tied to fixed coordinates, so the model must learn where they should move in 3D space before they can capture useful structure. If this alignment is learned only from occupancy levels, queries can can easily drift, cluster in the wrong places, or fail to track meaningful geometry over time.

S2GO addresses this with a geometry-first pretraining stage that teaches queries and their Gaussians to organize around  real 3D structure before semantic training begins. In this stage, query locations are initialized near LiDAR points with added noise, and the model is trained to denoise them back onto the underlying surfaces, explicitly learning how to move from empty space onto occupied geometry. At the same time, the decoded Gaussians are supervised through a rendering objective: they are used to render depth and RGB views that are compared against LiDAR-derived depth and camera images, providing dense geometric feedback that encourages each query’s Gaussians to capture fine local shape.

This two-part pretraining—denoising plus rendering—gives S2GO a strong geometric prior before it ever sees semantic occupancy labels. When the model transitions to the main semantic occupancy stage, it no longer relies on LiDAR and instead initializes queries from learnable positions and image features, but those queries now know how to reposition themselves in 3D and decode into Gaussians that align with real scene structure.

Results

S2GO delivers state-of-the-art 3D semantic occupancy performance while significantly improving inference speed over prior streaming Gaussian-based methods. On nuScenes benchmarks such as SurroundOcc and Occ3D, S2GO variants outperform strong Gaussian baselines like GaussianWorld by up to 1.5 IoU, while achieving up to roughly 5–6× faster inference, enabling real-time operation on a single GPU. 

The method also generalizes across datasets and settings. On KITTI-360 in a monocular 3D semantic occupancy configuration S2GO achieves leading accuracy, surpassing GaussianFormer-style baselines by several percentage points in both IoU and mIoU. These results highlight that the streaming query-based representation scales beyond a single benchmark or sensor setup.

Qualitatively, S2GO’s sparse, query-level world state leads to more stable temporal behavior over long sequences. Because the model reasons at the level of persistent queries instead of tens of thousands of low-level Gaussians, it better preserves object separation and scene structure, reducing merging and drift artifacts that can occur when streaming is performed directly on dense primitives.

Integration into Autonomous Driving Systems

S2GO is designed to fit naturally into modern vision-centric autonomy stacks as a lightweight, streaming 3D world model. Its output—a dense semantic occupancy grid—provides a direct representation of free space, static structures, and dynamic obstacles that downstream planning and control modules can consume without additional geometric post-processing.

Because S2GO maintains a compact query-based state instead of a dense 3D volume, it can run continuously under tight latency and compute budgets while still leveraging long temporal context. This is especially important in practical deployments, where the system must keep up with camera frame rates, handle frequent occlusions, and preserve stable geometry over time for reliable decision making.

More broadly, S2GO demonstrates that persistent 3D understanding does not require persistent dense 3D storage. By shifting the streaming state from dense grids to sparse queries, it offers a scalable path toward long-horizon, real-time, camera-only 3D scene understanding—an important building block for robust, deployable autonomous driving systems..

More details can be found on the project website and the preprinted paper, as well as our upcoming presentation in ICLR 2026.