Benchmarking#
MIKASA-Robo-VLA contains 90 tasks whose episode horizons range from 25 to 2160 simulation steps. Training a VLA on all 90 simultaneously is hard because the batch mixes episodes with vastly different lengths; it also makes fair comparisons between models difficult. To solve this we group tasks into three horizon splits and define a canonical per-split protocol.
Horizon Splits#
Split assignment is deterministic from Max Length in
mikasa_robo_vla_envs.csv:
Max Length ≤ 200 → Short
201 ≤ Max Length ≤ 601 → Medium
Max Length > 601 → Long
Split |
Tasks |
Horizon (steps) |
What it tests |
|---|---|---|---|
Short |
38 |
25 – 200 |
Rapid cue encoding and short-term recall. |
Medium |
30 |
201 – 601 |
Sustained working memory over moderately long episodes. |
Long |
22 |
602 – 2160 |
Extended memory, multi-phase reasoning, procedural recall. |
Memory-type distribution per split:
Short (38): Spatial 14 · Object 9 · Negative 9 · Temporal 3 · Tracking 2 · Prospective 1
Medium (30): Object 9 · Capacity 6 · Temporal 4 · Sequential 3 · Procedural 3 · Tracking 2 · Prospective 2 · Checklist 1
Long (22): Capacity 6 · Temporal 5 · Checklist 3 · Procedural 3 · Sequential 3 · Prospective 2
The full per-task table is in mikasa_robo_vla_envs.csv.
Canonical Protocol#
The canonical benchmarking procedure for any VLA model:
Choose a split —
short,medium, orlong.Download the datasets for every task in that split from Hugging Face (one repository = one dataset = one task; available in LeRobotDataset v3 and RLDS formats).
Train in multi-task mode on the combined data for the split.
Evaluate on every task in that split — 50 episodes per task, seeds
4242424242 … 4242424291.Report three numbers: per-split SR (main metric), full per-task SR table, and per-memory-type SR breakdown within the split. Optionally include mean return per task as a debug number.
See Evaluation Protocol for the exact seed convention, metric definitions, wrapper stack, action chunking rules, JSON output schema, and the reproducibility checklist.
Running an Evaluation#
The reference CLI is examples/eval_demo.py. It runs a checkpoint-free
DummyChunkPolicy so you can verify the pipeline end-to-end before
attaching your own model. Every run writes results to a timestamped
subdirectory under --output-dir so successive runs never overwrite each
other.
Smoke-test (one task, one episode, CPU backend):
uv run python examples/eval_demo.py \
--num-episodes 1 --sim-backend cpu \
--output-dir eval_results/dummy
Canonical Short-split run:
uv run python examples/eval_demo.py \
--split short \
--output-dir eval_results/dummy
All 90 tasks:
uv run python examples/eval_demo.py \
--split all \
--output-dir eval_results/dummy
Arbitrary subset of tasks:
uv run python examples/eval_demo.py \
--task RememberColor3-VLA-v0 \
--task ShellGameTouch-VLA-v0 \
--task BatteriesCheckerEasy-VLA-v0 \
--output-dir eval_results/dummy
With per-episode rollout videos:
uv run python examples/eval_demo.py \
--split short \
--save-videos \
--output-dir eval_results/dummy
Flag |
Default |
Description |
|---|---|---|
|
— |
Evaluate every task in the given horizon split. |
|
— |
Evaluate one specific env ID. Repeat for a custom multi-task subset.
Accepts any registered Gymnasium env, including IDs not in the
canonical CSV. Mutually exclusive with |
|
|
Episodes per task. |
|
|
Seed for episode |
|
|
Action chunk size for the built-in dummy policy. |
|
|
Base directory. A timestamped subdir is created automatically. |
|
— |
Continue an interrupted run. Pass the existing timestamped directory
(e.g. |
|
off |
Record a video for every episode of every task.
Videos go to |
|
auto |
Override the ManiSkill simulation backend. |
|
|
Passed directly to |
Resuming an Interrupted Run#
A full-split evaluation can take hours. If a run is interrupted (OOM,
preemption, Ctrl-C), pass --resume to continue from exactly where it
stopped:
# The split is inferred automatically from the existing result files.
uv run python examples/eval_demo.py \
--resume eval_results/dummy/short/2026-05-21_18-52-16
# Override or extend — e.g. resume files that mixed splits:
uv run python examples/eval_demo.py \
--resume eval_results/dummy/all/2026-05-21_18-52-16 \
--split all
How it works:
Every
*.jsonfile inRUN_DIR(exceptsummary.json) is loaded as a completed task result.The completed env IDs are subtracted from the full task list — only the remaining tasks are evaluated.
New result files are written into the same
RUN_DIR;summary.jsonis updated to include both old and new tasks after every completed task.The rich terminal display pre-populates the results table with already-completed tasks so you get a complete view at a glance.
Note
Use the same --start-seed and --num-episodes values you used for
the original run, or the resumed results will not be comparable with the
already-saved ones.
Note
Mixed-split resume directories. When the existing *.json files
come from different splits (e.g. you originally ran --split all),
--resume reads the "split" field of each result and:
if every file reports the same split, that split is selected;
if multiple distinct splits are present,
"all"is inferred and the remaining 90 tasks are scheduled.
Override either case by passing --split (or --task ENV_ID …)
explicitly.
Rich Terminal Display#
When running eval_demo.py (or any script that uses RichBenchmarkUI)
the terminal shows a live panel with three parts that update after every episode:
╭─── MIKASA-Robo-VLA Benchmark 4/38 tasks ───────────────────────────────╮
│ ⠹ Tasks [4/38] ShellGamePick-VLA-v0 ━━━━━━━━━━╸━━ 4/38 10% 0:01:… │
│ ⠹ Episodes ShellGamePick-VLA-v0 ━━━━━━━━━━━━━ 50/50 100% │
│ ────────────────────────────────────────────────────────────────────── │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━┳━━━━━━━━┓ │
│ ┃ Task ┃ Split ┃ Memory ┃ Eps ┃ SR ┃ Return ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━╇━━━━━━━━┩ │
│ │ RememberColor3-VLA-v0 │ Short │ Object │ 50 │ 82%│ 0.6341 │ │
│ │ ShellGameTouch-VLA-v0 │ Short │ Spatial │ 50 │ 56%│ 0.4120 │ │
│ │ ShellGamePick-VLA-v0 … │ Short │ Spatial │ 32/50 │ 44%│ 0.3012 │ │
│ └──────────────────────────┴───────┴───────────┴───────┴────┴────────┘ │
╰─────────────────────────────────────────────────────────────────────────╯
Top bar — overall task progress with elapsed time and ETA.
Middle bar — per-episode progress for the task currently running; resets to zero when a new task begins.
Table — one row per completed task (with final SR coloured green / yellow / red) plus one live row for the task currently in progress (shown with
(running…)label and values updated every episode).
When --resume is used, already-completed tasks appear in the table
immediately from the first frame so you always see the full picture.
To use RichBenchmarkUI from your own script see the
Python API section below.
Output Layout#
Each run creates its own directory under <output-dir>/<split>/:
eval_results/
my_model/
short/
2026-05-21_15-30-00/ ← one directory per run
summary.json ← updated after each completed task
RememberColor3-VLA-v0.json
ShellGameTouch-VLA-v0.json
...
videos/ ← only when --save-videos
RememberColor3-VLA-v0/
0.mp4 ← episode 0
1.mp4 ← episode 1
...
summary.json is rewritten after every task completes, so you can
inspect it at any point during a long run.
Python API#
Import the benchmark module directly to integrate evaluation into your own scripts.
Selecting tasks#
from mikasa_robo_suite.vla.benchmarking import select_benchmark_tasks
tasks = select_benchmark_tasks(split="short") # 38 tasks
tasks = select_benchmark_tasks(split="medium") # 30 tasks
tasks = select_benchmark_tasks(split="long") # 22 tasks
tasks = select_benchmark_tasks(split="all") # all 90 tasks
# Explicit list — any subset, order preserved
tasks = select_benchmark_tasks(env_ids=[
"RememberColor3-VLA-v0",
"ShellGameTouch-VLA-v0",
])
# Env IDs not in the canonical CSV are accepted (split="custom",
# memory_type="Unknown") so any registered Gymnasium env can be evaluated.
tasks = select_benchmark_tasks(env_ids=["MyCustomTask-VLA-v0"])
Each returned BenchmarkTask is a
small dataclass carrying everything evaluate_benchmark needs to construct
and run a task:
Field |
Type |
Meaning |
|---|---|---|
|
|
Gymnasium env ID, e.g. |
|
|
Horizon split label: |
|
|
Memory category from the CSV ( |
|
|
Episode horizon ( |
Running a benchmark#
from mikasa_robo_suite.vla.benchmarking import (
BenchmarkConfig,
evaluate_benchmark,
make_run_dir,
select_benchmark_tasks,
)
tasks = select_benchmark_tasks(split="short")
config = BenchmarkConfig(
n_episodes=50,
save_videos=True, # record per-episode rollout videos
sim_backend="gpu",
)
policy = MyPolicy(chunk_size=8)
run_dir = make_run_dir("eval_results/my_model/short")
results, summary = evaluate_benchmark(
tasks,
policy,
config,
output_dir=run_dir,
model={"name": "my-vla", "config": {"checkpoint": "path/to/ckpt"}},
progress=print,
)
print(f"SR_split = {summary['sr_split']:.2%}")
evaluate_benchmark writes <env_id>.json and updates summary.json
after every task, so partial runs are always readable on disk.
make_run_dir(base) creates base/YYYY-MM-DD_HH-MM-SS/ and returns the
path. Call it once at the start of a run to avoid overwriting previous
results.
Plugging in a real policy#
The benchmark runner expects any object with a chunk_size attribute and a
forward(obs) method:
class MyPolicy:
chunk_size = 8 # actions returned per forward pass
def forward(self, obs: dict) -> torch.Tensor:
# obs["rgb"] shape (1, 128, 128, 6), uint8
# obs["proprio"] shape (1, 7), float32
# info["language_instruction"] str
#
# Return shape: (chunk_size, 7), float32, values in [-1, 1]
...
For chunk_size=1 this is identical to standard per-step inference: the
internal FIFO action queue is consumed every step, so policy.forward(obs)
is called on every simulator step.
Note
``success_once`` is a latch, not a count. The runner OR-accumulates
info["success"] across an episode
success_once = success_once or bool(info["success"])
so the recorded value flips from False to True the first time
the task succeeds and stays True for the rest of the episode. See
Evaluation Protocol for full metric definitions.
Rich terminal UI#
RichBenchmarkUI is a context
manager that renders a live panel (progress bars + results table) in the
terminal. It exposes three callbacks that wire directly into
evaluate_benchmark().
from mikasa_robo_suite.vla.benchmarking import (
BenchmarkConfig,
RichBenchmarkUI,
evaluate_benchmark,
make_run_dir,
select_benchmark_tasks,
)
tasks = select_benchmark_tasks(split="short")
config = BenchmarkConfig(n_episodes=50)
policy = MyPolicy(chunk_size=8)
run_dir = make_run_dir("eval_results/my_model/short")
with RichBenchmarkUI(tasks, config.n_episodes) as ui:
results, summary = evaluate_benchmark(
tasks,
policy,
config,
output_dir=run_dir,
task_start_callback=ui.on_task_start,
episode_callback=ui.on_episode_done,
task_done_callback=ui.on_task_done,
)
To pre-populate the table when resuming a partial run, pass
initial_results — a list of already-completed task result dicts loaded
from disk:
import json
from pathlib import Path
resume_dir = Path("eval_results/my_model/short/2026-05-21_18-52-16")
completed = [
json.loads(p.read_text())
for p in sorted(resume_dir.glob("*.json"))
if p.stem != "summary"
]
remaining = [t for t in tasks if t.env_id not in {r["env_id"] for r in completed}]
with RichBenchmarkUI(remaining, config.n_episodes, initial_results=completed) as ui:
results, summary = evaluate_benchmark(
remaining,
policy,
config,
output_dir=resume_dir,
task_start_callback=ui.on_task_start,
episode_callback=ui.on_episode_done,
task_done_callback=ui.on_task_done,
initial_results=completed,
)
RichBenchmarkUI requires the rich package (pip install rich).
It is listed as a dependency of mikasa-robo-suite so it is installed
automatically when you install the package.
Flexible Usage#
The three-split protocol is the canonical procedure, but the 90 individual datasets let you compose any training mixture:
Single-task: download one dataset, train, evaluate on that task alone. Useful for ablations or environment-specific analyses.
Cross-split: train on Short, evaluate on Medium to test generalisation. Must be labelled explicitly; not comparable to canonical results.
Full-benchmark: train on all 90 tasks at once.
In all cases use the same eval_demo.py runner and the same seeding
convention so that individual task results remain comparable to the canonical
baseline.