Skip to content

xiaomi-research/recogdrive

Repository files navigation

ReCogDrive

A Reinforced Cognitive Framework for End-to-End Autonomous Driving

Yongkang Li1,2*, Kaixin Xiong2*, Xiangyu Guo1,2, Fang Li2, Sixu Yan1, Gangwei Xu1,2,
Lijun Zhou2, Long Chen2, Haiyang Sun2†, Bing Wang2, Kun Ma2, Guang Chen2,
Hangjun Ye2, Wenyu Liu1, Xinggang Wang1✉

1Huazhong University of Science and Technology
2Xiaomi EV

(*) Equal contribution. (†) Project leader. (✉) Corresponding author.

Arxiv 2025

Paper PDF Project Page huggingface collection  huggingface datasets 

News

  • Sept. 30th, 2025: We have updated our latest paper with more model details, experiments, and comprehensive visualizations. Meanwhile, we fixed the unintended NumPy issue 🐛 that previously caused inconsistencies in the training metric cache. Now the code ensures reproducible and consistent results. Special thanks to the discussion in issue #10 for bringing this up!
  • Aug. 24th, 2025: We have released all driving pretraining QA, including 12 driving datasets and our own annotated NavSim data. We have rewritten the scoring, filtering, and evaluation for open-source data. If it’s helpful to you, feel free to star and cite our work! 🚗💨
  • Aug. 21th, 2025: We release the initial version of code and weight on NAVSIM, along with documentation and training/evaluation scripts. We will also update our new revision of the paper and the pretraining datasets later this month or next month. Please stay tuned! ☕️
  • Jun. 11th, 2025: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Updates

  • Release Bench2Drive, DriveLM, NAVSIM2.0, Drivebench evaluation frameworks
  • Release Paper
  • Release Full Models and Training/Evaluation Framework
  • Release Full Driving QA Datasets
  • Release Updated Paper

Table of Contents

Abstract

Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. Code and models are available at ReCogDrive GitHub Repository.

Getting Started

Checkpoint

Results on NAVSIM

Method Model Size Training Stage PDMS Weight Download
ReCogDrive-Base-VLM 2B Stage 1 84.1 Model
ReCogDrive-Base-IL 2B + 35M Stage 1&2 86.5 Model
ReCogDrive-Base-RL 2B + 35M Stage 1&2&3 90.8 Model
ReCogDrive-Large-VLM 8B Stage 1 86.4 Model
ReCogDrive-Large-IL 8B + 35M Stage 1&2 86.5 Model
ReCogDrive-Large-RL 8B + 35M Stage 1&2&3 90.4 Model

Results on Bench2drive

Closed-loop and Multi-ability Testing Results in CARLA Bench2Drive Leaderboard
Method Closed-loop Metric ↑ Multi-Ability Test (%) ↑
Efficiency Comfort Success DS Merging Overtaking Emerg. Brake GiveWay Traf. Sign Mean
ReCogDrive 138.18 17.45 45.45 71.36 29.73 20.00 69.09 20.00 71.34 42.03

Results on DriveLM and DriveBench

Method DriveLM (GPT-Score) LingoQA (Lingo-Judge) DriveBench
Percep. Predict. Plan. Behav. Avg.
ReCogDrive 67.30 67.20 64.95 49.34 70.20 42.36 56.71

Driving Pretraining Datasets

Datasets Source Rewritten and filtered Annotations Jsonl
NAVSIM-Traj - JSONL
NAVSIM-ReCogDrive - JSONL
DriveLM link JSONL
Nuinstruct link JSONL
NuscenesQA link JSONL
Omnidrive link JSONL
Senna link JSONL
LingoQA link JSONL
Drama link JSONL
MapLM link JSONL
Talk2Car link JSONL
Drivegpt4 link JSONL
CODA-LM link JSONL
SUTD link JSONL
Bench2drive-Traj - JSONL
Bench2drive-QA link JSONL

Our ReCogDrive is pretrained on 12 open-source driving datasets. For most of these datasets, we leveraged Qwen2.5VL-72B to re-annotate the answers, applied standardized scoring, and filtered them to obtain 12 high-quality QA datasets. In addition, we built an automated annotation pipeline on Navsim, generating 752k QA pairs. These resources enable VLMs to better adapt to driving scenarios. If you only want to train a VLM for planning on a specific dataset, you can use just that dataset’s trajectories and QA (for example, NAVSIM-Traj and NAVSIM-ReCogDrive) to train the VLM and then perform planning; this can achieve results close to training on the full dataset. We perform large-scale pretraining to improve the VLM’s understanding across diverse driving scenarios.

We open-sourced these high-quality driving QA datasets in the hope of supporting research on Vision-Language-Action (VLA) for driving. If the official maintainers of any dataset prefer that we do not release the JSON annotations, we will remove them immediately. Please note that if you use these datasets, you must comply with the original licenses of the respective datasets. We emphasize that our usage of these datasets is solely for academic research purposes, with no commercial applications involved.

In addition, we provide training data on Bench2Drive, where we further fine-tune our models on mixed data and Navsim real-world scenarios, followed by training on Bench2Drive-Traj and Bench2Drive-QA to better adapt to the CARLA driving environment.

Qualitative Results on NAVSIM Navtest

We compare ReCogDrive (IL and RL) with Transfuser, where RL yields safer and more reliable trajectories in challenging turning scenarios. More visualizations are in the supplementary material.

Qualitative Results on Bench2drive

This visualization demonstrates the driving capabilities of ReCogDrive across diverse scenarios in both real-world settings and the CARLA-simulated Bench2Drive environment. The results show that our model can handle complex maneuvers such as lane following, turning, and interacting with traffic signs, reflecting strong adaptability to various driving contexts.

Contact

If you have any questions, please contact Yongkang Li via email (liyk@hust.edu.cn) or wechat (liyk_0803).

Acknowledgement

ReCogDrive is greatly inspired by the following outstanding contributions to the open-source community: NAVSIM, DPPO, LightningDiT, DiffusionDrive, Senna, GR00T.

Citation

If you find ReCogDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{li2025recogdrive,
  title={ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving},
  author={Li, Yongkang and Xiong, Kaixin and Guo, Xiangyu and Li, Fang and Yan, Sixu and Xu, Gangwei and Zhou, Lijun and Chen, Long and Sun, Haiyang and Wang, Bing and others},
  journal={arXiv preprint arXiv:2506.08052},
  year={2025}
}

Releases

No releases published

Packages

No packages published