Yongkang Li1,2*, Kaixin Xiong2*, Xiangyu Guo1,2, Fang Li2, Sixu Yan1, Gangwei Xu1,2,
Lijun Zhou2, Long Chen2, Haiyang Sun2†, Bing Wang2, Kun Ma2, Guang Chen2,
Hangjun Ye2, Wenyu Liu1, Xinggang Wang1✉
1Huazhong University of Science and Technology
2Xiaomi EV
(*) Equal contribution. (†) Project leader. (✉) Corresponding author.
Arxiv 2025
Sept. 30th, 2025: We have updated our latest paper with more model details, experiments, and comprehensive visualizations. Meanwhile, we fixed the unintended NumPy issue 🐛 that previously caused inconsistencies in the training metric cache. Now the code ensures reproducible and consistent results. Special thanks to the discussion in issue #10 for bringing this up!Aug. 24th, 2025: We have released all driving pretraining QA, including 12 driving datasets and our own annotated NavSim data. We have rewritten the scoring, filtering, and evaluation for open-source data. If it’s helpful to you, feel free to star and cite our work! 🚗💨Aug. 21th, 2025: We release the initial version of code and weight on NAVSIM, along with documentation and training/evaluation scripts. We will also update our new revision of the paper and the pretraining datasets later this month or next month. Please stay tuned! ☕️Jun. 11th, 2025: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
- Release Bench2Drive, DriveLM, NAVSIM2.0, Drivebench evaluation frameworks
- Release Paper
- Release Full Models and Training/Evaluation Framework
- Release Full Driving QA Datasets
- Release Updated Paper
- News
- Updates
- Table of Contents
- Abstract
- Getting Started
- Checkpoint
- Driving Pretraining Datasets
- Qualitative Results on NAVSIM Navtest
- Qualitative Results on Bench2drive
- Contact
- Acknowledgement
- Citation
Recent studies have explored leveraging the world knowledge and cognitive capabilities of Vision-Language Models (VLMs) to address the long-tail problem in end-to-end autonomous driving. However, existing methods typically formulate trajectory planning as a language modeling task, where physical actions are output in the language space, potentially leading to issues such as format-violating outputs, infeasible actions, and slow inference speeds. In this paper, we propose ReCogDrive, a novel Reinforced Cognitive framework for end-to-end autonomous Driving, unifying driving understanding and planning by integrating an autoregressive model with a diffusion planner. First, to instill human driving cognition into the VLM, we introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers through three stages: generation, refinement, and quality control. Building on this cognitive foundation, we then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner to efficiently generate continuous and stable trajectories. Furthermore, to enhance driving safety and reduce collisions, we introduce a Diffusion Group Relative Policy Optimization (DiffGRPO) stage, reinforcing the planner for enhanced safety and comfort. Extensive experiments on the NAVSIM and Bench2Drive benchmarks demonstrate that ReCogDrive achieves state-of-the-art performance. Additionally, qualitative results across diverse driving scenarios and DriveBench highlight the model's scene comprehension. Code and models are available at ReCogDrive GitHub Repository.
- Download NAVSIM datasets following official instruction
- Preparation of ReCogDrive environment
- ReCogDrive Training and Evaluation
Results on NAVSIM
| Method | Model Size | Training Stage | PDMS | Weight Download |
|---|---|---|---|---|
| ReCogDrive-Base-VLM | 2B | Stage 1 | 84.1 | Model |
| ReCogDrive-Base-IL | 2B + 35M | Stage 1&2 | 86.5 | Model |
| ReCogDrive-Base-RL | 2B + 35M | Stage 1&2&3 | 90.8 | Model |
| ReCogDrive-Large-VLM | 8B | Stage 1 | 86.4 | Model |
| ReCogDrive-Large-IL | 8B + 35M | Stage 1&2 | 86.5 | Model |
| ReCogDrive-Large-RL | 8B + 35M | Stage 1&2&3 | 90.4 | Model |
Results on Bench2drive
| Method | Closed-loop Metric ↑ | Multi-Ability Test (%) ↑ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Efficiency | Comfort | Success | DS | Merging | Overtaking | Emerg. Brake | GiveWay | Traf. Sign | Mean | |
| ReCogDrive | 138.18 | 17.45 | 45.45 | 71.36 | 29.73 | 20.00 | 69.09 | 20.00 | 71.34 | 42.03 |
Results on DriveLM and DriveBench
| Method | DriveLM (GPT-Score) | LingoQA (Lingo-Judge) | DriveBench | ||||
|---|---|---|---|---|---|---|---|
| Percep. | Predict. | Plan. | Behav. | Avg. | |||
| ReCogDrive | 67.30 | 67.20 | 64.95 | 49.34 | 70.20 | 42.36 | 56.71 |
| Datasets | Source | Rewritten and filtered Annotations Jsonl |
|---|---|---|
| NAVSIM-Traj | - | JSONL |
| NAVSIM-ReCogDrive | - | JSONL |
| DriveLM | link | JSONL |
| Nuinstruct | link | JSONL |
| NuscenesQA | link | JSONL |
| Omnidrive | link | JSONL |
| Senna | link | JSONL |
| LingoQA | link | JSONL |
| Drama | link | JSONL |
| MapLM | link | JSONL |
| Talk2Car | link | JSONL |
| Drivegpt4 | link | JSONL |
| CODA-LM | link | JSONL |
| SUTD | link | JSONL |
| Bench2drive-Traj | - | JSONL |
| Bench2drive-QA | link | JSONL |
Our ReCogDrive is pretrained on 12 open-source driving datasets. For most of these datasets, we leveraged Qwen2.5VL-72B to re-annotate the answers, applied standardized scoring, and filtered them to obtain 12 high-quality QA datasets. In addition, we built an automated annotation pipeline on Navsim, generating 752k QA pairs. These resources enable VLMs to better adapt to driving scenarios. If you only want to train a VLM for planning on a specific dataset, you can use just that dataset’s trajectories and QA (for example, NAVSIM-Traj and NAVSIM-ReCogDrive) to train the VLM and then perform planning; this can achieve results close to training on the full dataset. We perform large-scale pretraining to improve the VLM’s understanding across diverse driving scenarios.
We open-sourced these high-quality driving QA datasets in the hope of supporting research on Vision-Language-Action (VLA) for driving. If the official maintainers of any dataset prefer that we do not release the JSON annotations, we will remove them immediately. Please note that if you use these datasets, you must comply with the original licenses of the respective datasets. We emphasize that our usage of these datasets is solely for academic research purposes, with no commercial applications involved.
In addition, we provide training data on Bench2Drive, where we further fine-tune our models on mixed data and Navsim real-world scenarios, followed by training on Bench2Drive-Traj and Bench2Drive-QA to better adapt to the CARLA driving environment.
We compare ReCogDrive (IL and RL) with Transfuser, where RL yields safer and more reliable trajectories in challenging turning scenarios. More visualizations are in the supplementary material.
This visualization demonstrates the driving capabilities of ReCogDrive across diverse scenarios in both real-world settings and the CARLA-simulated Bench2Drive environment. The results show that our model can handle complex maneuvers such as lane following, turning, and interacting with traffic signs, reflecting strong adaptability to various driving contexts.
If you have any questions, please contact Yongkang Li via email (liyk@hust.edu.cn) or wechat (liyk_0803).
ReCogDrive is greatly inspired by the following outstanding contributions to the open-source community: NAVSIM, DPPO, LightningDiT, DiffusionDrive, Senna, GR00T.
If you find ReCogDrive is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{li2025recogdrive,
title={ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving},
author={Li, Yongkang and Xiong, Kaixin and Guo, Xiangyu and Li, Fang and Yan, Sixu and Xu, Gangwei and Zhou, Lijun and Chen, Long and Sun, Haiyang and Wang, Bing and others},
journal={arXiv preprint arXiv:2506.08052},
year={2025}
}


