Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning
Official implementation for continual reinforcement learning (CRL) with Vision-Language-Action (VLA) models. Built on top of RLinf.
- Overview
- Installation
- Downloading Models
- Quick Start
- CRL Experiment Scripts
- Configuration
- Evaluation
- Precomputing Base Logits
- Architecture & Code Structure
- Citation
We study continual reinforcement learning for large Vision-Language-Action models and find that simple sequential fine-tuning with LoRA consistently avoids catastrophic forgetting, maintains plasticity, and preserves zero-shot generalization, often matching or surpassing dedicated continual learning methods.
This codebase provides:
- Sequential fine-tuning (Seq. FT)
- CRL baselines — EWC, Experience Replay, Dark Experience Replay, Weight Merge, SLCA
- Multitask oracle — joint training upper bound
- Non-VLA baseline — Simple CNN policy
- Evaluation tools — per-task success, LoRA scaling analysis
Supported models: OpenVLA, OpenVLA-OFT
Supported simulators: LIBERO (Spatial, Object, Goal, Long suites)
Supported algorithms: PPO, GRPO
- Linux (tested on Ubuntu 22.04/24.04)
- NVIDIA GPU(s) with CUDA 12.x
- Conda
# 1. Clone the repository (includes bundled dependencies)
git clone git@github.com:UT-Austin-RobIn/continual-vla-rl.git
cd continual-vla-rl
# 2. Create conda environment
conda create -n vlacrl python=3.11.10 -y
conda activate vlacrl
# 3. Install core dependencies
pip install -r requirements.txt
# 4. Install bundled packages (included in this repo)
cd transformers-openvla-oft && pip install -e . && cd ..
cd openvla-oft && pip install -e . && cd ..
cd LIBERO && pip install -e . && cd ..Optional: For Flash Attention support (recommended for speed), install separately:
pip install flash-attn --no-build-isolationNote: The repository includes
transformers-openvla-oft,openvla-oft, andLIBEROas bundled directories. Steps 4–5 install them as editable packages — no separate cloning is needed.
Pre-trained SFT checkpoints are available on Hugging Face. Download them into the model/ directory at the repository root:
| Model | Suite | Command |
|---|---|---|
| OpenVLA-OFT SFT (Spatial, 1 traj) |
LIBERO-Spatial | hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir ./model/Openvla-oft-SFT-libero-spatial-traj1 |
| OpenVLA-OFT SFT (Object, 1 traj) |
LIBERO-Object | hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir ./model/Openvla-oft-SFT-libero-object-traj1 |
| OpenVLA-OFT SFT (10-task, all traj) |
LIBERO-10 | hf download Haozhan72/Openvla-oft-SFT-libero10-trajall --local-dir ./model/Openvla-oft-SFT-libero10-trajall |
The default configs expect models at ./model/<hf-repo-name>. To use a custom path, override these in the YAML config:
rollout:
model_dir: /your/custom/path
actor:
checkpoint_load_path: /your/custom/path
tokenizer:
tokenizer_model: /your/custom/pathThe LIBERO environment path defaults to ./LIBERO (the bundled copy). To use a different installation:
export LIBERO_REPO_PATH=/path/to/your/LIBEROAfter installation and model download, run a single-task sequential fine-tuning experiment:
# Train on task 0 of LIBERO-Spatial
bash examples/crl_experiment/run_embodiment_sequential.sh 0Or train sequentially on tasks 0 through 4:
bash examples/crl_experiment/run_embodiment_sequential.sh "0,4"All scripts are in examples/crl_experiment/ and source common_functions.sh for shared utilities.
| Script | Method | Example |
|---|---|---|
run_embodiment_sequential.sh |
Sequential Fine-Tuning (Seq. FT) | bash ... 0 or bash ... "0,4" |
run_embodiment_ewc.sh |
EWC (Elastic Weight Consolidation) | bash ... "0,4" |
run_embodiment_er.sh |
Experience Replay | bash ... 0 0.03 |
run_embodiment_der.sh |
Dark Experience Replay | bash ... 0 0.03 |
run_embodiment_weight_merge.sh |
Weight Merge | bash ... "0,3" 0.8 |
run_embodiment_slca.sh |
SLCA (learning-rate schedules) | bash ... "1,4" "2e-6,2e-6,1e-5" |
run_embodiment_multitask.sh |
Multitask (joint training) | bash ... "0,2,4" |
run_embodiment_simple_cnn.sh |
Simple CNN baseline | bash ... 0 |
run_embodiment_sequential_reorder.sh |
Seq. FT with custom task order | bash ... "0,4" |
Each script has its own argument pattern. Below are the signatures for each:
Sequential Fine-Tuning:
bash run_embodiment_sequential.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4"
# Example: bash ... 0 "" 15 "" 42
EWC:
bash run_embodiment_ewc.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4"
Experience Replay:
bash run_embodiment_er.sh TASK_ID_OR_RANGE [ER_COEFF=0.03] [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.03
Dark Experience Replay:
bash run_embodiment_der.sh TASK_ID_OR_RANGE [DER_COEFF=0.03] [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.03
Weight Merge:
bash run_embodiment_weight_merge.sh TASK_ID_OR_RANGE MERGE_COEFF [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.8
SLCA:
bash run_embodiment_slca.sh TASK_ID_OR_RANGE LR_STRING [CONFIG_NAME] [SEED]
# Example: bash ... "1,4" "2e-6,2e-6,1e-5"
Multitask:
bash run_embodiment_multitask.sh TASK_IDS [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,2,4"
Sequential (custom order):
bash run_embodiment_sequential_reorder.sh TASK_IDS RUN_ID [MAX_EPOCH] [CONFIG_NAME] [SEED] [INIT_CHECKPOINT]
# Example: bash ... "4,3,2,1,0" reorder_v1
Simple CNN:
bash run_embodiment_simple_cnn.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... 0
LoRA Scale Evaluation:
bash eval_embodiment_lora_scale.sh CHECKPOINT_LOCATION CURRENT_LORA_SCALE [PREVIOUS_LORA_COEFF] [STEP_NUMBER] [CONFIG_NAME]
# Example: bash ... logs/sequential/task_0 0.5
Common defaults across scripts:
- SEED:
1234 - CONFIG_NAME:
crl_experiment/libero_spatial_grpo_openvlaoft_spatial(varies for Simple CNN) - TASK_ID_OR_RANGE: Single task (
0) or comma-separated range ("0,4"trains tasks 0 through 4 sequentially)
# Evaluate a checkpoint (default: global_step_10)
bash examples/crl_experiment/eval_embodiment.sh logs/sequential/task_0_seed1234
# Evaluate at a specific step
bash examples/crl_experiment/eval_embodiment.sh logs/sequential/task_0_seed1234 20
# Evaluate Simple CNN
bash examples/crl_experiment/eval_embodiment.sh logs/simple_cnn/task_0_seed1234 10 crl_experiment/libero_spatial_grpo_simple_cnn_eval
# LoRA scale evaluation
bash examples/crl_experiment/eval_embodiment_lora_scale.sh logs/sequential/task_0 0.5Configs are YAML files in examples/embodiment/config/. For general RLinf configuration options (batch sizes, learning rates, FSDP settings, logging, etc.), see the RLinf documentation.
Below are the CRL-specific parameters used in this work.
Experience Replay / Dark Experience Replay:
algorithm:
use_experience_replay: True # enable replay buffer
bc_coeff: 0.03 # replay loss coefficient
# DER-specific (logit-based replay):
+algorithm.use_reference_logits_bc: True
+algorithm.use_cached_bc_logits: TrueEWC (Elastic Weight Consolidation):
algorithm:
+algorithm.use_ewc=TrueWeight Merge:
# Controlled via script argument (merge coefficient)
# e.g., bash run_embodiment_weight_merge.sh "0,4" 0.8SLCA (per-module learning rates):
# Controlled via script argument (comma-separated LRs for vision, LLM, head)
# e.g., bash run_embodiment_slca.sh "0,4" "2e-6,2e-6,1e-5"env:
fixed_task_ids: [0,1,2] # which task(s) to train on (null indicates all)Evaluation runs the trained policy across all tasks (train + held-out) and reports per-task success rates. Results are printed as a dictionary:
{
'eval/env_info/task_0_success': 1.0, # success rate for task 0
'eval/env_info/task_0_success_total': 8.0, # number of eval episodes
'eval/env_info/task_1_success': 0.75,
...
'eval/env_info/success_once': 0.775, # overall success (any point in episode)
'eval/env_info/success_at_end': 0.625, # success at final timestep
'eval/env_info/return': 3.125, # cumulative return
'eval/env_info/episode_len': 512.0, # episode length
}Key metrics:
task_X_success: Success rate for task X across evaluation episodes.success_once: Fraction of episodes where the task was completed at least once.success_at_end: Fraction of episodes where the task was completed at the final timestep.
Results are logged to WandB and/or TensorBoard (configurable via runner.logger.logger_backends).
For Dark Experience Replay (DER), base model logits can be precomputed and cached to disk to avoid recomputation during training.
Download the modified RLDS dataset into your LIBERO datasets directory:
hf download openvla/modified_libero_rlds --repo-type dataset --local-dir ./LIBERO/libero/datasets/This places the dataset at ./LIBERO/libero/datasets/.
bash examples/embodiment/compute_base_logits_embodiment.sh [CONFIG_NAME]This runs compute_base_logits_embodied_agent.py, which generates logits for each task's demonstration trajectories and saves them alongside the dataset. These cached logits are loaded automatically when use_cached_bc_logits: True is set in the config.
- Initialization — Cluster setup, create actor (FSDP), rollout worker, and environment workers from config
- Rollout — Environment interaction + action generation (pipelined across workers)
- Advantage computation — GAE (PPO) or group-relative advantages (GRPO)
- Policy update — LoRA parameter updates via the chosen algorithm
- Repeat — Sync weights to rollout worker, loop
Entry point: examples/embodiment/train_embodied_agent.py
examples/
embodiment/
config/ # YAML configs
crl_experiment/ # CRL-specific configs
train_embodied_agent.py # Training entry point
eval_embodied_agent.py # Evaluation entry point
compute_base_logits_embodied_agent.py # Logit precomputation
run_embodiment.sh # Core training launcher
compute_base_logits_embodiment.sh # Logit precomputation launcher
crl_experiment/
run_embodiment_sequential.sh # Sequential fine-tuning
run_embodiment_ewc.sh # EWC
run_embodiment_er.sh # Experience Replay
run_embodiment_der.sh # Dark Experience Replay
run_embodiment_weight_merge.sh # Weight Merge
run_embodiment_slca.sh # SLCA
run_embodiment_multitask.sh # Multitask training
run_embodiment_simple_cnn.sh # Simple CNN baseline
run_embodiment_sequential_reorder.sh # Custom task order
eval_embodiment.sh # Checkpoint evaluation
eval_embodiment_lora_scale.sh # LoRA scale evaluation
common_functions.sh # Shared utilities
rlinf/custom/ # Custom modules
libero_trajectory_dataset.py # LIBERO dataset loader
logits_precompute_worker.py # Logit caching worker
loss.py # CRL loss functions (EWC, ER, DER)
random_action_rollout_worker.py # Random baseline rollout
simple_cnn_utils.py # CNN policy utilities
If you use this codebase, please cite our paper:
@article{hu2026vlacrl,
title={Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning},
author={Hu, Jiaheng and Shim, Jay and Tang, Chen and Sung, Yoonchang and Liu, Bo and Stone, Peter and Martin-Martin, Roberto},
journal={arXiv preprint arXiv:2603.11653},
year={2026},
url={https://arxiv.org/abs/2603.11653}
}Since this codebase is built on RLinf, we recommend additionally citing:
@article{yu2025rlinf,
title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and Huang, Zixiao and Wei, Mingjie and Xie, Yuqing and Yang, Ke and Dai, Bo and Xu, Zhexuan and Wang, Xiangyuan and Fu, Xu and Liu, Zhihao and Chen, Kang and Liu, Weilin and Liu, Gang and Li, Boxun and Yang, Jianlei and Yang, Zhi and Dai, Guohao and Wang, Yu},
journal={arXiv preprint arXiv:2509.15965},
year={2025}
}This project is licensed under the Apache License 2.0. See LICENSE for details.