NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

NoRD eliminates reasoning overhead in autonomous driving, matching top benchmarks with less data by fixing a key flaw in reinforcement learning optimization

Shubh Gupta, Ishaan Rawal • June 3, 2026 • 6 min read

Current Vision-Language-Action (VLA) models for autonomous driving rely on two primary dependencies: massive dataset collection and dense reasoning annotations. While such models achieve high performance, they introduce significant data costs, computational overhead, and high inference latencies, making them impractical for real-world deployment. In this work, we challenge both of these dependencies with NoRD: No Reasoning for Driving. We identify that the performance gap in data-efficient, reasoning-free models is not due to a lack of reasoning or deficiency of data, but rather difficulty bias. This bias occurs during Reinforcement Learning optimization of a weak policy with sparse driving rewards, causing the model to ignore complex, difficult maneuvers in favor of learning from easy samples. By using Dr. GRPO to mitigate the difficulty bias, NoRD achieves competitive performance on Waymo and NAVSIM benchmarks using 3x fewer tokens and 60% less data than reasoning-heavy baselines. NoRD demonstrates that by simple modifications to the VLA optimization step, we can achieve high performance with a fraction of the training data and no reasoning overhead for more efficient autonomous systems.

Does an Autonomous Vehicle Need to “Think Out Loud” to Drive Well?

Vision-Language-Action (VLA) models represent a promising frontier for autonomous driving, particularly in resolving "long-tail" scenarios. These models typically reason through a scene, generating internal text traces before deciding on an action.

However, this mechanism of “thinking out loud” introduces two significant bottlenecks:

  1. High inference latency: The time required to generate hundreds of reasoning tokens autoregressively is a major hurdle for deployment in autonomous driving systems, where low latency is critical
  2. Costly annotation overhead: Training these models necessitates coupling vast amounts of driving data with corresponding reasoning annotations, which is both expensive and difficult to scale

To address this, we propose NoRD (No Reasoning for Driving). NoRD bypasses intermediate text generation entirely, mapping inputs directly to action. By leveraging the VLM’s internal spatial-temporal priors without the overhead"autoregressive tax" of reasoning tokens, NoRD maintains high-level understanding while reducing latency to be under a fraction of a second.

Conventional VLA Pipelines vs. NoRD

To achieve state-of-the-art performance, conventional VLA pipelines rely on a complex, data-heavy architecture. This traditional approach involves several resource-intensive stages:

  • Massive data acquisition: A large driving dataset is required as the foundation
  • Teacher-led reasoning: This data is fed through a high-performance Teacher LLM (like GPT-4o) to generate a corresponding “Reasoning Dataset”
  • Supervised fine-tuning (SFT): The model undergoes SFT using these reasoning annotations to learn how to “think out loud” before it acts
  • Reinforcement learning (RL): Finally, RL fine-tuning is applied to align the policy with driving metrics

While effective, this pipeline is inherently bottlenecked by the need for expensive teacher models and vast amounts of driving data.

In contrast, NoRD follows a simpler and more efficient alternative. By reducing the data removing the requirement for intermediate reasoning, we streamline the process significantly:

  • Lower data requirements: NoRD can effectively learn good driving behaviors using only a small driving dataset
  • SFT on trajectory data: Instead of training on reasoning traces, NoRD performs SFT directly on raw trajectory data
  • RL fFine-tuning: The resulting model is then fine-tuned with RL, learning to optimize its driving policy with faster inference without reasoning tokens

By eliminating the reasoning overhead, NoRD creates a recipe for high-performance autonomous driving that is not only cheaper to train but also faster to develop and deploy.

Standard GRPO Fails to Improve Non-Reasoning VLA Base Policy

While Group Relative Policy Optimization (GRPO) is commonly used in RL fine tuning of most existing reasoning-based VLAs, we observed that it produces a negligible 0.67% improvement when applied to a data-efficient, non-reasoning (weak) SFT base policy. This stands in contrast to the 9% boost observed in prior works (e.g., AutoVLA) using massive datasets and explicit reasoning traces. Rather than attributing this failure to an inherent limitation of weak policies, we explore an alternate hypothesis: that standard RL optimization is fundamentally ill equipped to train reasoning-free and data-efficient policies. This motivates our investigation into difficulty bias as a primary bottleneck to improvement.

Analysis of Optimization Failure: Difficulty Bias

To validate our hypothesis that the performance bottleneck lies within the optimization process rather than the model’s inherent capacity, we analyze the density of mean PDM (Predictive Driver Model) scores produced by the weak base policy across the NAVSIM dataset. This reveals two distinct regimes:

  • Low-variance “easy” scenarios: These common cases (e.g., driving straight) have high density near the maximum score; because the variance is low, the normalization denominator in the GRPO objective remains small, allowing the learning signal from these simple tasks to remain disproportionately high
  • High-variance “hard” scenarios: Critical long-tail situations exhibit high variance and lower scores; in these complex cases, the high variance increases the normalization denominator, which effectively lowers the learning signal and prevents the model from optimizing on these long-tail tasks

In standard GRPO, gradients are averaged across groups. Hence, the model spends its learning budget on easy scenarios it has already mastered, while the suppressed signals from hard scenarios fail to drive policy improvement. We interpret this failure as a form of difficulty bias in GRPO.

Dr. GRPO as a Drop-in Replacement for Mitigating Difficulty Bias

To overcome the limitations of standard RL, we employ Dr. GRPO as a drop-in replacement for fine-tuning our weak base policy. By rebalancing the optimization to be less sensitive to the group variance, Dr. GRPO enables the model to optimize in our complex reward landscape. Notably, it converts our previous 0.67% gain with GRPO into an 11.68% gain, demonstrating a much more robust shift in the optimization curves toward high-performance scores. This allows NoRD to learn complex driving maneuvers that were previously unattainable through standard GRPO.

Qualitative Results on Long-Tail Scenarios

NoRD demonstrates robust spatial-temporal understanding across diverse long-tail scenarios, executing complex maneuvers.

WaymoE2E

Driving in Rain

Construction zones

Turning in traffic

Traffic light understanding

NAVSIM

Extreme light transitions

Stable following

Making an unprotected turn

Dense traffic

Efficiency results

By bypassing intermediate reasoning tokens, NoRD achieves massive efficiency gains, significantly reducing token counts and run time over reasoning VLAs

Failure modes

While NoRD navigates a wide variety of traffic scenarios effectively, it is not without limitations. Our analysis of the model’s performance reveals specific failure modes, such as in making wider turns than necessary and multi-agent interaction, that provide avenues for future improvement.

Aggressive maneuvering

Blindness to rear traffic

Conservative stops

BIBTEX
@inproceedings{rawal2026nord,
  title={NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning},
  author={Rawal, Ishaan and Gupta, Shubh and Hu, Yihan and Zhan, Wei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Shubh Gupta

Research Scientist

Shubh Gupta is a Research Scientist at Applied Intuition focused on generative AI and autonomous driving. He holds a PhD in Electrical Engineering from Stanford University, where his research centered on high-integrity localization for autonomous vehicles. While at Stanford he conducted postdoctoral research on neural mapping and navigation systems.

Ishaan Rawal

Research Intern

Ishaan Rawal was a Research Intern at Applied Intuition working on vision-language-action model post-training for autonomous driving. He holds a Master's degree in Computer Science from Texas A&M University and a BE in Computer Science from Birla Institute of Technology and Science, Pilani.