-
-
Notifications
You must be signed in to change notification settings - Fork 176
Description
We aim to reproduce the Baseline-UniVLA score of 2.795 from the AgiBot-World Manipulation Challenge Leaderboard in our local environment.
We are testing in the registry.agibot.com/genie-sim/open_source:latest Docker container (Isaac Sim 4.5, PyTorch 2.5.1+cu118, TensorFlow 2.10, CUDA 11.8) using /root/workspace/main/AgiBot-World/UniVLA/infer.py (with autorun(cfg: RunConfig)) and a batch script calling ./scripts/autorun.sh.
We have two questions to achieve this:
1. Recommended episodes_per_instance
When evaluating tasks (e.g., iros_clear_the_countertop_waste at genie_sim/source/geniesim/benchmark/ader/eval_tasks/task_gen/iros_clear_the_countertop_waste_1.json), scores vary across runs for the same task. Is this variation expected?

What is the recommended episodes_per_instance to stabilize scores and match the leaderboard’s 2.795?
2. Final Score Calculation
With 10 task categories, each generating 5 task files (e.g., iros_clear_the_countertop_waste_1.json to _5.json), and assuming episodes_per_instance=10, how is the final score calculated?
We tried a weighted average of STEP scores but couldn’t match 2.795. (1.69 locally) Please provide the exact score calculation formula.
Environment:
Docker: registry.agibot.com/genie-sim/open_source:latest
Scripts: /root/workspace/main/AgiBot-World/UniVLA/infer.py, ./scripts/autorun.sh
Python: 3.10
Frameworks: PyTorch 2.5.1+cu118, TensorFlow 2.10
CUDA: 11.8
Driver: 535.183.01
Expected Behavior:
Reproduce Baseline-UniVLA score of 2.795 locally.
Actual Behavior:
Scores vary, and aggregated STEP scores don’t match 2.795.
Thank you for guidance on reproducing the leaderboard score!