Benchmarking ============ MIKASA-Robo-VLA contains 90 tasks whose episode horizons range from **25 to 2160** simulation steps. Training a VLA on all 90 simultaneously is hard because the batch mixes episodes with vastly different lengths; it also makes fair comparisons between models difficult. To solve this we group tasks into three **horizon splits** and define a canonical per-split protocol. .. contents:: On this page :depth: 2 :local: .. _horizon-splits: Horizon Splits -------------- Split assignment is deterministic from ``Max Length`` in ``mikasa_robo_vla_envs.csv``: .. code-block:: text Max Length ≤ 200 → Short 201 ≤ Max Length ≤ 601 → Medium Max Length > 601 → Long .. list-table:: :header-rows: 1 :widths: 14 12 24 50 * - Split - Tasks - Horizon (steps) - What it tests * - **Short** - 38 - 25 – 200 - Rapid cue encoding and short-term recall. * - **Medium** - 30 - 201 – 601 - Sustained working memory over moderately long episodes. * - **Long** - 22 - 602 – 2160 - Extended memory, multi-phase reasoning, procedural recall. Memory-type distribution per split: .. code-block:: text Short (38): Spatial 14 · Object 9 · Negative 9 · Temporal 3 · Tracking 2 · Prospective 1 Medium (30): Object 9 · Capacity 6 · Temporal 4 · Sequential 3 · Procedural 3 · Tracking 2 · Prospective 2 · Checklist 1 Long (22): Capacity 6 · Temporal 5 · Checklist 3 · Procedural 3 · Sequential 3 · Prospective 2 The full per-task table is in `mikasa_robo_vla_envs.csv `_. .. _canonical-protocol: Canonical Protocol ------------------ The canonical benchmarking procedure for any VLA model: 1. **Choose a split** — ``short``, ``medium``, or ``long``. 2. **Download the datasets** for every task in that split from Hugging Face (one repository = one dataset = one task; available in `LeRobotDataset v3 `_ and `RLDS `_ formats). 3. **Train in multi-task mode** on the combined data for the split. 4. **Evaluate on every task in that split** — 50 episodes per task, seeds ``4242424242 … 4242424291``. 5. **Report three numbers**: per-split SR (main metric), full per-task SR table, and per-memory-type SR breakdown within the split. Optionally include mean return per task as a debug number. See :doc:`evaluation_protocol` for the exact seed convention, metric definitions, wrapper stack, action chunking rules, JSON output schema, and the reproducibility checklist. .. _running-eval: Running an Evaluation --------------------- The reference CLI is ``examples/eval_demo.py``. It runs a checkpoint-free ``DummyChunkPolicy`` so you can verify the pipeline end-to-end before attaching your own model. Every run writes results to a **timestamped subdirectory** under ``--output-dir`` so successive runs never overwrite each other. Smoke-test (one task, one episode, CPU backend):: uv run python examples/eval_demo.py \ --num-episodes 1 --sim-backend cpu \ --output-dir eval_results/dummy Canonical Short-split run:: uv run python examples/eval_demo.py \ --split short \ --output-dir eval_results/dummy All 90 tasks:: uv run python examples/eval_demo.py \ --split all \ --output-dir eval_results/dummy Arbitrary subset of tasks:: uv run python examples/eval_demo.py \ --task RememberColor3-VLA-v0 \ --task ShellGameTouch-VLA-v0 \ --task BatteriesCheckerEasy-VLA-v0 \ --output-dir eval_results/dummy With per-episode rollout videos:: uv run python examples/eval_demo.py \ --split short \ --save-videos \ --output-dir eval_results/dummy .. list-table:: ``eval_demo.py`` CLI flags :header-rows: 1 :widths: 32 14 54 * - Flag - Default - Description * - ``--split short|medium|long|all`` - — - Evaluate every task in the given horizon split. ``all`` runs all 90 canonical tasks. Mutually exclusive with ``--task``. * - ``--task ENV_ID`` - — - Evaluate one specific env ID. Repeat for a custom multi-task subset. Accepts any registered Gymnasium env, including IDs not in the canonical CSV. Mutually exclusive with ``--split``. * - ``--num-episodes N`` - ``50`` - Episodes per task. * - ``--start-seed N`` - ``4242424242`` - Seed for episode ``i`` is ``start_seed + i``. * - ``--chunk-size K`` - ``8`` - Action chunk size for the built-in dummy policy. * - ``--output-dir PATH`` - ``eval_results/dummy`` - Base directory. A timestamped subdir is created automatically. * - ``--resume RUN_DIR`` - — - Continue an interrupted run. Pass the existing timestamped directory (e.g. ``eval_results/dummy/short/2026-05-21_18-52-16``). Tasks that already have a result ``.json`` there are skipped; results are written back to the same directory. The split is inferred automatically from the existing files; override with ``--split`` if needed. * - ``--save-videos`` - off - Record a video for every episode of every task. Videos go to ``//videos//``. * - ``--sim-backend cpu|gpu`` - auto - Override the ManiSkill simulation backend. * - ``--render-mode`` - ``all`` - Passed directly to ``gym.make``. .. _resume: Resuming an Interrupted Run ---------------------------- A full-split evaluation can take hours. If a run is interrupted (OOM, preemption, Ctrl-C), pass ``--resume`` to continue from exactly where it stopped: .. code-block:: bash # The split is inferred automatically from the existing result files. uv run python examples/eval_demo.py \ --resume eval_results/dummy/short/2026-05-21_18-52-16 # Override or extend — e.g. resume files that mixed splits: uv run python examples/eval_demo.py \ --resume eval_results/dummy/all/2026-05-21_18-52-16 \ --split all How it works: 1. Every ``*.json`` file in ``RUN_DIR`` (except ``summary.json``) is loaded as a completed task result. 2. The completed env IDs are subtracted from the full task list — only the remaining tasks are evaluated. 3. New result files are written into the same ``RUN_DIR``; ``summary.json`` is updated to include both old and new tasks after every completed task. 4. The :ref:`rich terminal display ` pre-populates the results table with already-completed tasks so you get a complete view at a glance. .. note:: Use the same ``--start-seed`` and ``--num-episodes`` values you used for the original run, or the resumed results will not be comparable with the already-saved ones. .. note:: **Mixed-split resume directories.** When the existing ``*.json`` files come from different splits (e.g. you originally ran ``--split all``), ``--resume`` reads the ``"split"`` field of each result and: - if every file reports the same split, that split is selected; - if multiple distinct splits are present, ``"all"`` is inferred and the remaining 90 tasks are scheduled. Override either case by passing ``--split`` (or ``--task ENV_ID …``) explicitly. .. _rich-ui: Rich Terminal Display --------------------- When running ``eval_demo.py`` (or any script that uses :class:`RichBenchmarkUI`) the terminal shows a live panel with three parts that update after every episode: .. code-block:: text ╭─── MIKASA-Robo-VLA Benchmark 4/38 tasks ───────────────────────────────╮ │ ⠹ Tasks [4/38] ShellGamePick-VLA-v0 ━━━━━━━━━━╸━━ 4/38 10% 0:01:… │ │ ⠹ Episodes ShellGamePick-VLA-v0 ━━━━━━━━━━━━━ 50/50 100% │ │ ────────────────────────────────────────────────────────────────────── │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┳━━━━┳━━━━━━━━┓ │ │ ┃ Task ┃ Split ┃ Memory ┃ Eps ┃ SR ┃ Return ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━╇━━━━╇━━━━━━━━┩ │ │ │ RememberColor3-VLA-v0 │ Short │ Object │ 50 │ 82%│ 0.6341 │ │ │ │ ShellGameTouch-VLA-v0 │ Short │ Spatial │ 50 │ 56%│ 0.4120 │ │ │ │ ShellGamePick-VLA-v0 … │ Short │ Spatial │ 32/50 │ 44%│ 0.3012 │ │ │ └──────────────────────────┴───────┴───────────┴───────┴────┴────────┘ │ ╰─────────────────────────────────────────────────────────────────────────╯ - **Top bar** — overall task progress with elapsed time and ETA. - **Middle bar** — per-episode progress for the task currently running; resets to zero when a new task begins. - **Table** — one row per completed task (with final SR coloured green / yellow / red) plus one live row for the task currently in progress (shown with ``(running…)`` label and values updated every episode). When ``--resume`` is used, already-completed tasks appear in the table immediately from the first frame so you always see the full picture. To use :class:`RichBenchmarkUI` from your own script see the :ref:`Python API ` section below. .. _output-layout: Output Layout ------------- Each run creates its own directory under ``//``: .. code-block:: text eval_results/ my_model/ short/ 2026-05-21_15-30-00/ ← one directory per run summary.json ← updated after each completed task RememberColor3-VLA-v0.json ShellGameTouch-VLA-v0.json ... videos/ ← only when --save-videos RememberColor3-VLA-v0/ 0.mp4 ← episode 0 1.mp4 ← episode 1 ... ``summary.json`` is rewritten after every task completes, so you can inspect it at any point during a long run. .. _python-api: Python API ---------- Import the benchmark module directly to integrate evaluation into your own scripts. Selecting tasks ~~~~~~~~~~~~~~~ .. code-block:: python from mikasa_robo_suite.vla.benchmarking import select_benchmark_tasks tasks = select_benchmark_tasks(split="short") # 38 tasks tasks = select_benchmark_tasks(split="medium") # 30 tasks tasks = select_benchmark_tasks(split="long") # 22 tasks tasks = select_benchmark_tasks(split="all") # all 90 tasks # Explicit list — any subset, order preserved tasks = select_benchmark_tasks(env_ids=[ "RememberColor3-VLA-v0", "ShellGameTouch-VLA-v0", ]) # Env IDs not in the canonical CSV are accepted (split="custom", # memory_type="Unknown") so any registered Gymnasium env can be evaluated. tasks = select_benchmark_tasks(env_ids=["MyCustomTask-VLA-v0"]) Each returned :class:`~mikasa_robo_suite.vla.benchmarking.BenchmarkTask` is a small dataclass carrying everything ``evaluate_benchmark`` needs to construct and run a task: .. list-table:: ``BenchmarkTask`` fields :header-rows: 1 :widths: 28 18 54 * - Field - Type - Meaning * - ``env_id`` - ``str`` - Gymnasium env ID, e.g. ``"RememberColor3-VLA-v0"``. * - ``split`` - ``str`` - Horizon split label: ``"Short"`` / ``"Medium"`` / ``"Long"``, or ``"custom"`` for env IDs not in ``mikasa_robo_vla_envs.csv``. * - ``memory_type`` - ``str`` - Memory category from the CSV (``"Object"``, ``"Spatial"``, ``"Capacity"``, …), or ``"Unknown"`` for custom env IDs. * - ``max_episode_steps`` - ``int`` - Episode horizon (``Max Length`` column in the CSV). The runner uses this to size the per-episode progress bar and bound the rollout loop. Running a benchmark ~~~~~~~~~~~~~~~~~~~ .. code-block:: python from mikasa_robo_suite.vla.benchmarking import ( BenchmarkConfig, evaluate_benchmark, make_run_dir, select_benchmark_tasks, ) tasks = select_benchmark_tasks(split="short") config = BenchmarkConfig( n_episodes=50, save_videos=True, # record per-episode rollout videos sim_backend="gpu", ) policy = MyPolicy(chunk_size=8) run_dir = make_run_dir("eval_results/my_model/short") results, summary = evaluate_benchmark( tasks, policy, config, output_dir=run_dir, model={"name": "my-vla", "config": {"checkpoint": "path/to/ckpt"}}, progress=print, ) print(f"SR_split = {summary['sr_split']:.2%}") ``evaluate_benchmark`` writes ``.json`` and updates ``summary.json`` after every task, so partial runs are always readable on disk. ``make_run_dir(base)`` creates ``base/YYYY-MM-DD_HH-MM-SS/`` and returns the path. Call it once at the start of a run to avoid overwriting previous results. Plugging in a real policy ~~~~~~~~~~~~~~~~~~~~~~~~~ The benchmark runner expects any object with a ``chunk_size`` attribute and a ``forward(obs)`` method: .. code-block:: python class MyPolicy: chunk_size = 8 # actions returned per forward pass def forward(self, obs: dict) -> torch.Tensor: # obs["rgb"] shape (1, 128, 128, 6), uint8 # obs["proprio"] shape (1, 7), float32 # info["language_instruction"] str # # Return shape: (chunk_size, 7), float32, values in [-1, 1] ... For ``chunk_size=1`` this is identical to standard per-step inference: the internal FIFO action queue is consumed every step, so ``policy.forward(obs)`` is called on every simulator step. .. note:: **``success_once`` is a latch, not a count.** The runner OR-accumulates ``info["success"]`` across an episode .. code-block:: python success_once = success_once or bool(info["success"]) so the recorded value flips from ``False`` to ``True`` the *first* time the task succeeds and stays ``True`` for the rest of the episode. See :doc:`evaluation_protocol` for full metric definitions. .. _rich-ui-api: Rich terminal UI ~~~~~~~~~~~~~~~~ :class:`~mikasa_robo_suite.vla.benchmarking.RichBenchmarkUI` is a context manager that renders a live panel (progress bars + results table) in the terminal. It exposes three callbacks that wire directly into :func:`~mikasa_robo_suite.vla.benchmarking.evaluate_benchmark`. .. code-block:: python from mikasa_robo_suite.vla.benchmarking import ( BenchmarkConfig, RichBenchmarkUI, evaluate_benchmark, make_run_dir, select_benchmark_tasks, ) tasks = select_benchmark_tasks(split="short") config = BenchmarkConfig(n_episodes=50) policy = MyPolicy(chunk_size=8) run_dir = make_run_dir("eval_results/my_model/short") with RichBenchmarkUI(tasks, config.n_episodes) as ui: results, summary = evaluate_benchmark( tasks, policy, config, output_dir=run_dir, task_start_callback=ui.on_task_start, episode_callback=ui.on_episode_done, task_done_callback=ui.on_task_done, ) To pre-populate the table when resuming a partial run, pass ``initial_results`` — a list of already-completed task result dicts loaded from disk: .. code-block:: python import json from pathlib import Path resume_dir = Path("eval_results/my_model/short/2026-05-21_18-52-16") completed = [ json.loads(p.read_text()) for p in sorted(resume_dir.glob("*.json")) if p.stem != "summary" ] remaining = [t for t in tasks if t.env_id not in {r["env_id"] for r in completed}] with RichBenchmarkUI(remaining, config.n_episodes, initial_results=completed) as ui: results, summary = evaluate_benchmark( remaining, policy, config, output_dir=resume_dir, task_start_callback=ui.on_task_start, episode_callback=ui.on_episode_done, task_done_callback=ui.on_task_done, initial_results=completed, ) :class:`RichBenchmarkUI` requires the ``rich`` package (``pip install rich``). It is listed as a dependency of ``mikasa-robo-suite`` so it is installed automatically when you install the package. .. _flexible-usage: Flexible Usage -------------- The three-split protocol is the canonical procedure, but the 90 individual datasets let you compose any training mixture: - **Single-task**: download one dataset, train, evaluate on that task alone. Useful for ablations or environment-specific analyses. - **Cross-split**: train on Short, evaluate on Medium to test generalisation. Must be labelled explicitly; not comparable to canonical results. - **Full-benchmark**: train on all 90 tasks at once. In all cases use the same ``eval_demo.py`` runner and the same seeding convention so that individual task results remain comparable to the canonical baseline.