Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Official implementation for continual reinforcement learning (CRL) with Vision-Language-Action (VLA) models. Built on top of RLinf.

Overview

We study continual reinforcement learning for large Vision-Language-Action models and find that simple sequential fine-tuning with LoRA consistently avoids catastrophic forgetting, maintains plasticity, and preserves zero-shot generalization, often matching or surpassing dedicated continual learning methods.

This codebase provides:

Sequential fine-tuning (Seq. FT)
CRL baselines — EWC, Experience Replay, Dark Experience Replay, Weight Merge, SLCA
Multitask oracle — joint training upper bound
Non-VLA baseline — Simple CNN policy
Evaluation tools — per-task success, LoRA scaling analysis

Supported models: OpenVLA, OpenVLA-OFT

Supported simulators: LIBERO (Spatial, Object, Goal, Long suites)

Supported algorithms: PPO, GRPO

Installation

Prerequisites

Linux (tested on Ubuntu 22.04/24.04)
NVIDIA GPU(s) with CUDA 12.x
Conda

Setup

# 1. Clone the repository (includes bundled dependencies)
git clone git@github.com:UT-Austin-RobIn/continual-vla-rl.git
cd continual-vla-rl

# 2. Create conda environment
conda create -n vlacrl python=3.11.10 -y
conda activate vlacrl

# 3. Install core dependencies
pip install -r requirements.txt

# 4. Install bundled packages (included in this repo)
cd transformers-openvla-oft && pip install -e . && cd ..
cd openvla-oft && pip install -e . && cd ..
cd LIBERO && pip install -e . && cd ..

Optional: For Flash Attention support (recommended for speed), install separately:
pip install flash-attn --no-build-isolation
Note: The repository includes transformers-openvla-oft, openvla-oft, and LIBERO as bundled directories. Steps 4–5 install them as editable packages — no separate cloning is needed.

Downloading Models

Pre-trained SFT checkpoints are available on Hugging Face. Download them into the model/ directory at the repository root:

Model	Suite	Command
OpenVLA-OFT SFT (Spatial, 1 traj)	LIBERO-Spatial	`hf download Haozhan72/Openvla-oft-SFT-libero-spatial-traj1 --local-dir ./model/Openvla-oft-SFT-libero-spatial-traj1`
OpenVLA-OFT SFT (Object, 1 traj)	LIBERO-Object	`hf download Haozhan72/Openvla-oft-SFT-libero-object-traj1 --local-dir ./model/Openvla-oft-SFT-libero-object-traj1`
OpenVLA-OFT SFT (10-task, all traj)	LIBERO-10	`hf download Haozhan72/Openvla-oft-SFT-libero10-trajall --local-dir ./model/Openvla-oft-SFT-libero10-trajall`

The default configs expect models at ./model/<hf-repo-name>. To use a custom path, override these in the YAML config:

rollout:
  model_dir: /your/custom/path
actor:
  checkpoint_load_path: /your/custom/path
  tokenizer:
    tokenizer_model: /your/custom/path

LIBERO Path

The LIBERO environment path defaults to ./LIBERO (the bundled copy). To use a different installation:

export LIBERO_REPO_PATH=/path/to/your/LIBERO

Quick Start

After installation and model download, run a single-task sequential fine-tuning experiment:

# Train on task 0 of LIBERO-Spatial
bash examples/crl_experiment/run_embodiment_sequential.sh 0

Or train sequentially on tasks 0 through 4:

bash examples/crl_experiment/run_embodiment_sequential.sh "0,4"

CRL Experiment Scripts

All scripts are in examples/crl_experiment/ and source common_functions.sh for shared utilities.

Training

Script	Method	Example
`run_embodiment_sequential.sh`	Sequential Fine-Tuning (Seq. FT)	`bash ... 0` or `bash ... "0,4"`
`run_embodiment_ewc.sh`	EWC (Elastic Weight Consolidation)	`bash ... "0,4"`
`run_embodiment_er.sh`	Experience Replay	`bash ... 0 0.03`
`run_embodiment_der.sh`	Dark Experience Replay	`bash ... 0 0.03`
`run_embodiment_weight_merge.sh`	Weight Merge	`bash ... "0,3" 0.8`
`run_embodiment_slca.sh`	SLCA (learning-rate schedules)	`bash ... "1,4" "2e-6,2e-6,1e-5"`
`run_embodiment_multitask.sh`	Multitask (joint training)	`bash ... "0,2,4"`
`run_embodiment_simple_cnn.sh`	Simple CNN baseline	`bash ... 0`
`run_embodiment_sequential_reorder.sh`	Seq. FT with custom task order	`bash ... "0,4"`

Script Arguments

Each script has its own argument pattern. Below are the signatures for each:

Sequential Fine-Tuning:

bash run_embodiment_sequential.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4"
# Example: bash ... 0 "" 15 "" 42

EWC:

bash run_embodiment_ewc.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4"

Experience Replay:

bash run_embodiment_er.sh TASK_ID_OR_RANGE [ER_COEFF=0.03] [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.03

Dark Experience Replay:

bash run_embodiment_der.sh TASK_ID_OR_RANGE [DER_COEFF=0.03] [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.03

Weight Merge:

bash run_embodiment_weight_merge.sh TASK_ID_OR_RANGE MERGE_COEFF [CONFIG_NAME] [SEED]
# Example: bash ... "0,4" 0.8

SLCA:

bash run_embodiment_slca.sh TASK_ID_OR_RANGE LR_STRING [CONFIG_NAME] [SEED]
# Example: bash ... "1,4" "2e-6,2e-6,1e-5"

Multitask:

bash run_embodiment_multitask.sh TASK_IDS [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... "0,2,4"

Sequential (custom order):

bash run_embodiment_sequential_reorder.sh TASK_IDS RUN_ID [MAX_EPOCH] [CONFIG_NAME] [SEED] [INIT_CHECKPOINT]
# Example: bash ... "4,3,2,1,0" reorder_v1

Simple CNN:

bash run_embodiment_simple_cnn.sh TASK_ID_OR_RANGE [CHECKPOINT_PATH] [MAX_EPOCH] [CONFIG_NAME] [SEED]
# Example: bash ... 0

LoRA Scale Evaluation:

bash eval_embodiment_lora_scale.sh CHECKPOINT_LOCATION CURRENT_LORA_SCALE [PREVIOUS_LORA_COEFF] [STEP_NUMBER] [CONFIG_NAME]
# Example: bash ... logs/sequential/task_0 0.5

Common defaults across scripts:

SEED: 1234
CONFIG_NAME: crl_experiment/libero_spatial_grpo_openvlaoft_spatial (varies for Simple CNN)
TASK_ID_OR_RANGE: Single task (0) or comma-separated range ("0,4" trains tasks 0 through 4 sequentially)

Evaluation

# Evaluate a checkpoint (default: global_step_10)
bash examples/crl_experiment/eval_embodiment.sh logs/sequential/task_0_seed1234

# Evaluate at a specific step
bash examples/crl_experiment/eval_embodiment.sh logs/sequential/task_0_seed1234 20

# Evaluate Simple CNN
bash examples/crl_experiment/eval_embodiment.sh logs/simple_cnn/task_0_seed1234 10 crl_experiment/libero_spatial_grpo_simple_cnn_eval

# LoRA scale evaluation
bash examples/crl_experiment/eval_embodiment_lora_scale.sh logs/sequential/task_0 0.5

Configuration

Configs are YAML files in examples/embodiment/config/. For general RLinf configuration options (batch sizes, learning rates, FSDP settings, logging, etc.), see the RLinf documentation.

Below are the CRL-specific parameters used in this work.

CRL Method Parameters

Experience Replay / Dark Experience Replay:

algorithm:
  use_experience_replay: True  # enable replay buffer
  bc_coeff: 0.03               # replay loss coefficient

  # DER-specific (logit-based replay):
  +algorithm.use_reference_logits_bc: True
  +algorithm.use_cached_bc_logits: True

EWC (Elastic Weight Consolidation):

algorithm:
  +algorithm.use_ewc=True

Weight Merge:

# Controlled via script argument (merge coefficient)
# e.g., bash run_embodiment_weight_merge.sh "0,4" 0.8

SLCA (per-module learning rates):

# Controlled via script argument (comma-separated LRs for vision, LLM, head)
# e.g., bash run_embodiment_slca.sh "0,4" "2e-6,2e-6,1e-5"

Task Configuration

env:
  fixed_task_ids: [0,1,2]              # which task(s) to train on (null indicates all)

Evaluation

Evaluation runs the trained policy across all tasks (train + held-out) and reports per-task success rates. Results are printed as a dictionary:

{
    'eval/env_info/task_0_success': 1.0,       # success rate for task 0
    'eval/env_info/task_0_success_total': 8.0,  # number of eval episodes
    'eval/env_info/task_1_success': 0.75,
    ...
    'eval/env_info/success_once': 0.775,       # overall success (any point in episode)
    'eval/env_info/success_at_end': 0.625,     # success at final timestep
    'eval/env_info/return': 3.125,             # cumulative return
    'eval/env_info/episode_len': 512.0,        # episode length
}

Key metrics:

task_X_success: Success rate for task X across evaluation episodes.
success_once: Fraction of episodes where the task was completed at least once.
success_at_end: Fraction of episodes where the task was completed at the final timestep.

Results are logged to WandB and/or TensorBoard (configurable via runner.logger.logger_backends).

Precomputing Base Logits

For Dark Experience Replay (DER), base model logits can be precomputed and cached to disk to avoid recomputation during training.

Prerequisites

Download the modified RLDS dataset into your LIBERO datasets directory:

hf download openvla/modified_libero_rlds --repo-type dataset --local-dir ./LIBERO/libero/datasets/

This places the dataset at ./LIBERO/libero/datasets/.

Running

bash examples/embodiment/compute_base_logits_embodiment.sh [CONFIG_NAME]

This runs compute_base_logits_embodied_agent.py, which generates logits for each task's demonstration trajectories and saves them alongside the dataset. These cached logits are loaded automatically when use_cached_bc_logits: True is set in the config.

Architecture & Code Structure

Training Flow

Initialization — Cluster setup, create actor (FSDP), rollout worker, and environment workers from config
Rollout — Environment interaction + action generation (pipelined across workers)
Advantage computation — GAE (PPO) or group-relative advantages (GRPO)
Policy update — LoRA parameter updates via the chosen algorithm
Repeat — Sync weights to rollout worker, loop

Entry point: examples/embodiment/train_embodied_agent.py

Key Directories

examples/
  embodiment/
    config/                                 # YAML configs
      crl_experiment/                       # CRL-specific configs
    train_embodied_agent.py                 # Training entry point
    eval_embodied_agent.py                  # Evaluation entry point
    compute_base_logits_embodied_agent.py   # Logit precomputation
    run_embodiment.sh                       # Core training launcher
    compute_base_logits_embodiment.sh       # Logit precomputation launcher
  crl_experiment/
    run_embodiment_sequential.sh            # Sequential fine-tuning
    run_embodiment_ewc.sh                   # EWC
    run_embodiment_er.sh                    # Experience Replay
    run_embodiment_der.sh                   # Dark Experience Replay
    run_embodiment_weight_merge.sh          # Weight Merge
    run_embodiment_slca.sh                  # SLCA
    run_embodiment_multitask.sh             # Multitask training
    run_embodiment_simple_cnn.sh            # Simple CNN baseline
    run_embodiment_sequential_reorder.sh    # Custom task order
    eval_embodiment.sh                      # Checkpoint evaluation
    eval_embodiment_lora_scale.sh           # LoRA scale evaluation
    common_functions.sh                     # Shared utilities

rlinf/custom/                               # Custom modules
  libero_trajectory_dataset.py              # LIBERO dataset loader
  logits_precompute_worker.py               # Logit caching worker
  loss.py                                   # CRL loss functions (EWC, ER, DER)
  random_action_rollout_worker.py           # Random baseline rollout
  simple_cnn_utils.py                       # CNN policy utilities

Citation

If you use this codebase, please cite our paper:

@article{hu2026vlacrl,
  title={Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning},
  author={Hu, Jiaheng and Shim, Jay and Tang, Chen and Sung, Yoonchang and Liu, Bo and Stone, Peter and Martin-Martin, Roberto},
  journal={arXiv preprint arXiv:2603.11653},
  year={2026},
  url={https://arxiv.org/abs/2603.11653}
}

Since this codebase is built on RLinf, we recommend additionally citing:

@article{yu2025rlinf,
  title={RLinf: Flexible and Efficient Large-scale Reinforcement Learning via Macro-to-Micro Flow Transformation},
  author={Yu, Chao and Wang, Yuanqing and Guo, Zhen and Lin, Hao and Xu, Si and Zang, Hongzhi and Zhang, Quanlu and Wu, Yongji and Zhu, Chunyang and Hu, Junhao and Huang, Zixiao and Wei, Mingjie and Xie, Yuqing and Yang, Ke and Dai, Bo and Xu, Zhexuan and Wang, Xiangyuan and Fu, Xu and Liu, Zhihao and Chen, Kang and Liu, Weilin and Liu, Gang and Li, Boxun and Yang, Jianlei and Yang, Zhi and Dai, Guohao and Wang, Yu},
  journal={arXiv preprint arXiv:2509.15965},
  year={2025}
}

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 196 Commits
.github		.github
LIBERO		LIBERO
docker		docker
examples		examples
openvla-oft		openvla-oft
ray_utils		ray_utils
requirements		requirements
rlinf		rlinf
transformers-openvla-oft		transformers-openvla-oft
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Table of Contents

Overview

Installation

Prerequisites

Setup

Downloading Models

LIBERO Path

Quick Start

CRL Experiment Scripts

Training

Script Arguments

Evaluation

Configuration

CRL Method Parameters

Task Configuration

Evaluation

Precomputing Base Logits

Prerequisites

Running

Architecture & Code Structure

Training Flow

Key Directories

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

Table of Contents

Overview

Installation

Prerequisites

Setup

Downloading Models

LIBERO Path

Quick Start

CRL Experiment Scripts

Training

Script Arguments

Evaluation

Configuration

CRL Method Parameters

Task Configuration

Evaluation

Precomputing Base Logits

Prerequisites

Running

Architecture & Code Structure

Training Flow

Key Directories

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages