Evaluation Protocol#
This page is the canonical specification for evaluating a trained VLA model on MIKASA-Robo-VLA. For how to run evaluations in practice — CLI flags, output layout, Python API — see Benchmarking.
Protocol at a Glance#
Parameter |
Canonical value |
|---|---|
Split |
Chosen explicitly by the user ( |
Episodes per task |
50 |
Seeds |
|
Parallel envs |
|
|
|
|
|
|
|
Wrapper stack |
|
Main metric |
|
Debug metric |
|
Observation and Action Format#
After gym.make + apply_mikasa_vla_wrappers(env, include_overlays=False),
obs always contains obs["rgb"] ((B, 128, 128, 6) uint8 — two
cameras concatenated on the channel axis) and obs["proprio"]
((B, 7) float32 — absolute EEF pose + gripper opening). The
Rotate* family additionally exposes obs["task_cue"] (target angle
for the RL oracle); for all other tasks this key is dropped by the
canonical helper. oracle_info is never exposed in the VLA-facing
observation. The action is a (B, 7) float32 tensor in the
normalised pd_ee_delta_pose action space; all seven components are
clamped to [-1, 1].
The canonical, field-by-field specification — value ranges, units, the
exact gripper conventions, and the difference between
obs["proprio"][..., 6] and action[6] — lives in
Observation and Action Space. All evaluation pipelines (online runs, dataset
collection, published datasets) use exactly the same layout, so the
Observation and Action Space reference is authoritative for every context.
Important
A canonical leaderboard policy consumes only the camera images,
obs["proprio"], and info["language_instruction"]. oracle_info
is privileged and stripped from the VLA-facing observation. task_cue
exists solely as an RL-training signal for the Rotate* family — the
same target angle is also embedded in the natural-language instruction,
so VLAs should ignore the task_cue key entirely. Reading either
field disqualifies a run from the canonical leaderboard.
Action chunking#
VLA policies typically predict a chunk of K actions per forward pass.
The benchmark runner consumes them with a collections.deque (FIFO):
At the start of every episode a fresh, empty
dequeis created.On each simulator step, if the deque is empty,
policy.forward(obs)is called to produce(K, action_dim)actions which are pushed onto the deque withextend.One action is popped from the front with
popleftand passed toenv.step(action).
For K=1 this is identical to standard per-step inference: the deque is
refilled on every step. See evaluate_task in
mikasa_robo_suite/vla/benchmarking.py for the exact implementation.
Metrics#
success_once (main metric)#
success_once is a boolean latch, not a count. At each simulator
step the runner performs:
success_once = success_once or bool(info["success"]) # OR-latch
so the value flips from False to True the first time the task
succeeds and stays True for the remainder of the episode. The
canonical per-episode metric is the final value of this latch:
success_once = any(info["success"] is True over all simulator steps)
This unifies tasks that succeed mid-episode (e.g. ShellGameTouch) and
tasks that only emit success on the final step (e.g. TimedTransfer) under
a single binary outcome. See mikasa_robo_suite/vla/benchmarking.py
(evaluate_task) for the exact code path.
return (debug only)#
return = sum(reward) over all steps of the episode
Use this to inspect reward shaping or training quality. Do not aggregate it to split level or use it as a comparison number.
Reported SR levels#
Three numbers are required per evaluation run:
Metric |
Definition |
|---|---|
Per-task SR |
|
Per-split SR (main) |
|
Per-memory-type SR |
For each memory type present in the split:
|
Seeding#
For every task independently, run 50 episodes with seeds:
env.reset(seed=4242424242 + i) for i in 0, 1, …, 49
The starting seed 4242424242 matches the one used by muVLA when it was
evaluated on MIKASA-Robo, so results are directly comparable to that baseline.
Note
Seeds are the same across tasks — each task independently starts from
4242424242. This gives full determinism, but it means the per-episode
initial states are correlated across tasks. Two tasks evaluated under
this protocol are not statistically independent. Account for this when
computing confidence intervals or aggregate statistics.
num_envs#
The canonical evaluation uses num_envs=1. The episode-to-seed mapping is
then trivial: episode i is the state produced by
env.reset(seed=4242424242 + i).
For wall-clock speedups (especially for the Long split) you may use
num_envs > 1 as an opt-in optimisation, but each parallel env must be
seeded so that the set of 50 (env_id, episode_idx) pairs is identical to
the num_envs=1 run. The reported SR must match.
Wrapper Stack#
Always construct the env with:
env = gym.make(
env_id,
num_envs=1,
obs_mode="rgb",
control_mode="pd_ee_delta_pose",
reward_mode="normalized_dense",
render_mode="all",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=False)
include_overlays=False strips debug/render overlays so they cannot affect
timing or bleed into recorded observations. The benchmark runner
(evaluate_benchmark) does this automatically.
JSON Output Schema#
Per-task file <ENV_ID>.json#
One file per evaluated env ID:
{
"env_id": "RememberColor3-VLA-v0",
"split": "Short",
"memory_type": "Object",
"start_seed": 4242424242,
"n_episodes": 50,
"successes": [true, false, true, "..."],
"returns": [12.4, 0.0, 15.7, "..."],
"sr": 0.74,
"mean_return": 9.8,
"benchmark_commit": "<git sha>",
"control_mode": "pd_ee_delta_pose",
"obs_mode": "rgb",
"wrapper_chain": "apply_mikasa_vla_wrappers(include_overlays=False)",
"action_chunk_size": 8,
"model": {"name": "my-vla", "config": {}},
"episode_lengths": [30, 28, 30, "..."],
"episode_seeds": [4242424242, 4242424243, "..."]
}
successes and returns have length n_episodes. sr and
mean_return are their means. episode_lengths (the number of
simulator steps in each episode before termination) and episode_seeds
(start_seed + i for i in 0…n_episodes-1) are always written by
evaluate_benchmark and may be safely consumed by downstream tools, but
they are not required for the canonical SR comparison and may be omitted
from manually constructed result files.
Summary file summary.json#
One file per run, updated after every task completes:
{
"split": "Short",
"sr_split": 0.61,
"sr_per_memory_type": {
"Negative": 0.42,
"Object": 0.55,
"Prospective": 0.30,
"Spatial": 0.72,
"Temporal": 0.50,
"Tracking": 0.40
},
"tasks": ["RememberColor3-VLA-v0", "..."],
"per_task_sr": {"RememberColor3-VLA-v0": 0.74, "...": 0.0},
"per_task_mean_return": {"RememberColor3-VLA-v0": 9.8, "...": 0.0}
}
summary.json is rewritten on disk after every completed task, so a run
that is interrupted mid-way leaves a valid partial summary.
Reference Evaluation Script#
examples/eval_demo.py is a working reference built around
mikasa_robo_suite.vla.benchmarking. It uses DummyChunkPolicy
(random actions) so you can run it end-to-end before plugging in a real model:
"""Run a MIKASA-Robo-VLA benchmark evaluation.
Replace ``DummyChunkPolicy`` with your own policy adapter. The benchmark
runner handles CSV split loading, canonical per-task seeds, action chunk
queues, ``success_once`` latching, JSON files, and split aggregation.
Results are saved to a timestamped subdirectory under ``--output-dir`` so
successive runs never overwrite each other.
Examples (run from the repository root)
----------------------------------------
Smoke-test one task, one episode::
uv run python examples/eval_demo.py \
--num-episodes 1 --sim-backend cpu \
--output-dir eval_results/dummy
Full canonical Short-split run, 50 episodes per task::
uv run python examples/eval_demo.py \
--split short \
--output-dir results/my_model
Specific tasks::
uv run python examples/eval_demo.py \
--task RememberColor3-VLA-v0 \
--task ShellGameTouch-VLA-v0 \
--output-dir results/my_model
All 90 benchmark tasks::
uv run python examples/eval_demo.py \
--split all \
--output-dir results/my_model
With per-episode rollout videos::
uv run python examples/eval_demo.py \
--split short \
--save-videos \
--output-dir results/my_model
"""
from __future__ import annotations
import warnings
warnings.filterwarnings("ignore")
import argparse
import json
from pathlib import Path
from typing import List, Mapping, Optional, Sequence, Tuple
import torch
from mikasa_robo_suite.vla.benchmarking import (
NUM_EPISODES_PER_TASK,
START_SEED,
BenchmarkConfig,
JsonDict,
RichBenchmarkUI,
evaluate_benchmark,
make_run_dir,
select_benchmark_tasks,
)
class DummyChunkPolicy:
"""Return random action chunks in the canonical 7D EE-delta action space."""
def __init__(self, chunk_size: int = 8, action_dim: int = 7):
if chunk_size <= 0:
raise ValueError(f"chunk_size must be > 0, got {chunk_size}")
self.chunk_size = int(chunk_size)
self.action_dim = int(action_dim)
@torch.no_grad()
def forward(self, obs: Mapping[str, object]) -> torch.Tensor:
proprio = obs.get("proprio")
device = proprio.device if torch.is_tensor(proprio) else torch.device("cpu")
return torch.empty(
(self.chunk_size, self.action_dim),
device=device,
dtype=torch.float32,
).uniform_(-1.0, 1.0)
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description=__doc__,
formatter_class=argparse.RawDescriptionHelpFormatter,
)
selection = parser.add_mutually_exclusive_group()
selection.add_argument(
"--split",
choices=("short", "medium", "long", "all"),
help=(
"Evaluate every task in the given horizon split, "
"or 'all' for all 90 benchmark tasks."
),
)
selection.add_argument(
"--task",
action="append",
dest="tasks",
metavar="ENV_ID",
help="Evaluate one env ID. Repeat to build an arbitrary subset.",
)
parser.add_argument("--num-episodes", type=int, default=NUM_EPISODES_PER_TASK)
parser.add_argument("--start-seed", type=int, default=START_SEED)
parser.add_argument("--chunk-size", type=int, default=8)
parser.add_argument(
"--output-dir",
type=Path,
default=Path("eval_results") / "dummy",
help="Base directory. Results go into a timestamped subdirectory.",
)
parser.add_argument(
"--resume",
type=Path,
default=None,
metavar="RUN_DIR",
help=(
"Resume an interrupted run. Pass the timestamped run directory "
"(e.g. eval_results/dummy/short/2026-05-21_18-52-16). "
"Tasks that already have a result JSON there are skipped. "
"The split is inferred from existing results; override with --split."
),
)
parser.add_argument(
"--save-videos",
action="store_true",
help=(
"Save a rollout video for every episode of every task. "
"Videos are written to <output-dir>/<timestamp>/videos/<env_id>/."
),
)
parser.add_argument(
"--sim-backend",
default="gpu",
help="ManiSkill sim backend ('cpu' or 'gpu'). Default: gpu.",
)
parser.add_argument("--render-mode", default="all")
return parser.parse_args()
def _split_label(tasks: Sequence) -> str:
splits = {task.split for task in tasks}
return next(iter(splits)) if len(splits) == 1 else "custom"
def _load_resume_state(resume_dir: Path) -> Tuple[List[JsonDict], Optional[str]]:
"""Load completed task results from *resume_dir* and infer the benchmark split.
Returns ``(completed_results, inferred_split)``. *inferred_split* is one of
``"short"``, ``"medium"``, ``"long"``, ``"all"``, or ``None`` if it cannot
be determined (user must pass ``--split`` explicitly in that case).
"""
if not resume_dir.is_dir():
raise FileNotFoundError(f"--resume directory not found: {resume_dir}")
completed: List[JsonDict] = []
for p in sorted(resume_dir.glob("*.json")):
if p.stem == "summary":
continue
with p.open(encoding="utf-8") as f:
completed.append(json.load(f))
# Infer split from completed results: if all tasks share one split → that
# split; if multiple splits are present → "all" (full benchmark run).
splits = {r.get("split", "").lower() for r in completed}
splits.discard("")
if not splits:
inferred_split: Optional[str] = None
elif len(splits) == 1:
inferred_split = next(iter(splits))
else:
inferred_split = "all"
return completed, inferred_split
def main() -> None:
args = parse_args()
completed_results: List[JsonDict] = []
if args.resume is not None:
# ------------------------------------------------------------------ #
# Resume mode: load finished tasks, determine remaining work #
# ------------------------------------------------------------------ #
completed_results, inferred_split = _load_resume_state(args.resume)
completed_ids = {r["env_id"] for r in completed_results}
# Task selection: explicit flag wins; fall back to inferred split.
if args.split:
all_tasks = select_benchmark_tasks(split=args.split)
elif args.tasks:
all_tasks = select_benchmark_tasks(env_ids=args.tasks)
elif inferred_split:
all_tasks = select_benchmark_tasks(split=inferred_split)
else:
raise SystemExit(
"Cannot infer the task split from the resume directory. "
"Pass --split or --task explicitly."
)
tasks = [t for t in all_tasks if t.env_id not in completed_ids]
run_dir = args.resume # write results back into the same directory
else:
# ------------------------------------------------------------------ #
# Fresh run #
# ------------------------------------------------------------------ #
if args.split:
tasks = select_benchmark_tasks(split=args.split)
elif args.tasks:
tasks = select_benchmark_tasks(env_ids=args.tasks)
else:
tasks = select_benchmark_tasks(env_ids=["RememberColor3-VLA-v0"])
run_dir = make_run_dir(args.output_dir / _split_label(tasks))
config = BenchmarkConfig(
start_seed=args.start_seed,
n_episodes=args.num_episodes,
sim_backend=args.sim_backend,
save_videos=args.save_videos,
)
policy = DummyChunkPolicy(chunk_size=args.chunk_size)
with RichBenchmarkUI(tasks, config.n_episodes, initial_results=completed_results or None) as ui:
_, summary = evaluate_benchmark(
tasks,
policy,
config,
output_dir=run_dir,
model={"name": "dummy-random-chunk-policy", "config": {"chunk_size": policy.chunk_size}},
task_start_callback=ui.on_task_start,
episode_callback=ui.on_episode_done,
task_done_callback=ui.on_task_done,
initial_results=completed_results if completed_results else None,
)
from rich.console import Console
from rich.table import Table
console = Console()
console.print()
summary_table = Table(
title="[bold]Benchmark Summary[/bold]",
show_header=True,
header_style="bold cyan",
border_style="blue",
)
summary_table.add_column("Memory Type", style="white")
summary_table.add_column("SR", justify="right")
for memory_type, sr in summary["sr_per_memory_type"].items():
color = "bright_green" if sr >= 0.7 else "yellow" if sr >= 0.4 else "red"
summary_table.add_row(memory_type, f"[{color}]{sr:.2%}[/{color}]")
summary_table.add_section()
sr_split = summary["sr_split"]
split_color = "bright_green" if sr_split >= 0.7 else "yellow" if sr_split >= 0.4 else "red"
summary_table.add_row(
"[bold]Overall SR[/bold]",
f"[bold {split_color}]{sr_split:.2%}[/bold {split_color}]",
)
console.print(summary_table)
console.print(f"[dim]JSON results → {run_dir}[/dim]")
if args.save_videos:
console.print(f"[dim]Videos → {run_dir / 'videos'}[/dim]")
if __name__ == "__main__":
main()
Run from the repository root:
# One-task smoke-test, one episode, CPU backend
uv run python examples/eval_demo.py \
--num-episodes 1 --sim-backend cpu \
--output-dir eval_results/dummy
# Canonical Short-split run
uv run python examples/eval_demo.py \
--split short \
--output-dir results/my_model
# With per-episode rollout videos
uv run python examples/eval_demo.py \
--split short \
--save-videos \
--output-dir results/my_model
See Benchmarking for the full CLI reference and Python API.
What to Report in Papers#
Reproducibility checklist:
[ ] Benchmark git commit.
[ ] Split evaluated (Short / Medium / Long, or explicit subset + label).
[ ] Start seed and episodes per task (default:
4242424242× 50).[ ] GPU model and count.
[ ] Training dataset (Hugging Face repo).
[ ] Training compute (GPU hours or gradient steps) and config (LoRA rank, etc.).
[ ] Action chunk size
K.[ ] Required: per-split SR, full per-task SR table, per-memory-type SR.
[ ] Mean return per task (optional debug metric).
[ ] Whether
oracle_infoortask_cuewere passed to the policy. Both are RL-side fields (task_cueonly exists for Rotate* and is already encoded in the language instruction); reading either disqualifies the run from the canonical leaderboard.
Cross-Split and Partial Evaluation#
Evaluating on a different split than the training split, or on a custom subset, is allowed but must be reported separately with an explicit label such as “trained on Short, evaluated on Medium”. Do not mix such results into the same SR table as canonical runs.