Evaluation Protocol#

This page is the canonical specification for evaluating a trained VLA model on MIKASA-Robo-VLA. For how to run evaluations in practice — CLI flags, output layout, Python API — see Benchmarking.

Protocol at a Glance#

Parameter

Canonical value

Split

Chosen explicitly by the user (short / medium / long). No auto-detection from a checkpoint.

Episodes per task

50

Seeds

4242424242, 4242424243, …, 4242424291 (same stream for every task)

Parallel envs

num_envs=1 (default); num_envs > 1 allowed as opt-in speedup

obs_mode

"rgb"

control_mode

"pd_ee_delta_pose"

reward_mode

"normalized_dense"

Wrapper stack

apply_mikasa_vla_wrappers(env, include_overlays=False)

Main metric

success_once per episode

Debug metric

mean_return per task (not aggregated to split level)

Observation and Action Format#

After gym.make + apply_mikasa_vla_wrappers(env, include_overlays=False), obs always contains obs["rgb"] ((B, 128, 128, 6) uint8 — two cameras concatenated on the channel axis) and obs["proprio"] ((B, 7) float32 — absolute EEF pose + gripper opening). The Rotate* family additionally exposes obs["task_cue"] (target angle for the RL oracle); for all other tasks this key is dropped by the canonical helper. oracle_info is never exposed in the VLA-facing observation. The action is a (B, 7) float32 tensor in the normalised pd_ee_delta_pose action space; all seven components are clamped to [-1, 1].

The canonical, field-by-field specification — value ranges, units, the exact gripper conventions, and the difference between obs["proprio"][..., 6] and action[6] — lives in Observation and Action Space. All evaluation pipelines (online runs, dataset collection, published datasets) use exactly the same layout, so the Observation and Action Space reference is authoritative for every context.

Important

A canonical leaderboard policy consumes only the camera images, obs["proprio"], and info["language_instruction"]. oracle_info is privileged and stripped from the VLA-facing observation. task_cue exists solely as an RL-training signal for the Rotate* family — the same target angle is also embedded in the natural-language instruction, so VLAs should ignore the task_cue key entirely. Reading either field disqualifies a run from the canonical leaderboard.

Action chunking#

VLA policies typically predict a chunk of K actions per forward pass. The benchmark runner consumes them with a collections.deque (FIFO):

  1. At the start of every episode a fresh, empty deque is created.

  2. On each simulator step, if the deque is empty, policy.forward(obs) is called to produce (K, action_dim) actions which are pushed onto the deque with extend.

  3. One action is popped from the front with popleft and passed to env.step(action).

For K=1 this is identical to standard per-step inference: the deque is refilled on every step. See evaluate_task in mikasa_robo_suite/vla/benchmarking.py for the exact implementation.

Metrics#

success_once (main metric)#

success_once is a boolean latch, not a count. At each simulator step the runner performs:

success_once = success_once or bool(info["success"])   # OR-latch

so the value flips from False to True the first time the task succeeds and stays True for the remainder of the episode. The canonical per-episode metric is the final value of this latch:

success_once = any(info["success"] is True over all simulator steps)

This unifies tasks that succeed mid-episode (e.g. ShellGameTouch) and tasks that only emit success on the final step (e.g. TimedTransfer) under a single binary outcome. See mikasa_robo_suite/vla/benchmarking.py (evaluate_task) for the exact code path.

return (debug only)#

return = sum(reward)  over all steps of the episode

Use this to inspect reward shaping or training quality. Do not aggregate it to split level or use it as a comparison number.

Reported SR levels#

Three numbers are required per evaluation run:

Metric

Definition

Per-task SR

SR_task = mean(success_once) over the 50 episodes.

Per-split SR (main)

SR_split = mean(SR_task  for all tasks in the split).

Per-memory-type SR

For each memory type present in the split: SR_mem = mean(SR_task  for tasks with that memory_type).

Seeding#

For every task independently, run 50 episodes with seeds:

env.reset(seed=4242424242 + i)   for i in 0, 1, …, 49

The starting seed 4242424242 matches the one used by muVLA when it was evaluated on MIKASA-Robo, so results are directly comparable to that baseline.

Note

Seeds are the same across tasks — each task independently starts from 4242424242. This gives full determinism, but it means the per-episode initial states are correlated across tasks. Two tasks evaluated under this protocol are not statistically independent. Account for this when computing confidence intervals or aggregate statistics.

num_envs#

The canonical evaluation uses num_envs=1. The episode-to-seed mapping is then trivial: episode i is the state produced by env.reset(seed=4242424242 + i).

For wall-clock speedups (especially for the Long split) you may use num_envs > 1 as an opt-in optimisation, but each parallel env must be seeded so that the set of 50 (env_id, episode_idx) pairs is identical to the num_envs=1 run. The reported SR must match.

Wrapper Stack#

Always construct the env with:

env = gym.make(
    env_id,
    num_envs=1,
    obs_mode="rgb",
    control_mode="pd_ee_delta_pose",
    reward_mode="normalized_dense",
    render_mode="all",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=False)

include_overlays=False strips debug/render overlays so they cannot affect timing or bleed into recorded observations. The benchmark runner (evaluate_benchmark) does this automatically.

JSON Output Schema#

Per-task file <ENV_ID>.json#

One file per evaluated env ID:

{
  "env_id": "RememberColor3-VLA-v0",
  "split": "Short",
  "memory_type": "Object",
  "start_seed": 4242424242,
  "n_episodes": 50,
  "successes": [true, false, true, "..."],
  "returns":   [12.4, 0.0, 15.7, "..."],
  "sr": 0.74,
  "mean_return": 9.8,
  "benchmark_commit": "<git sha>",
  "control_mode": "pd_ee_delta_pose",
  "obs_mode": "rgb",
  "wrapper_chain": "apply_mikasa_vla_wrappers(include_overlays=False)",
  "action_chunk_size": 8,
  "model": {"name": "my-vla", "config": {}},
  "episode_lengths": [30, 28, 30, "..."],
  "episode_seeds":   [4242424242, 4242424243, "..."]
}

successes and returns have length n_episodes. sr and mean_return are their means. episode_lengths (the number of simulator steps in each episode before termination) and episode_seeds (start_seed + i for i in 0…n_episodes-1) are always written by evaluate_benchmark and may be safely consumed by downstream tools, but they are not required for the canonical SR comparison and may be omitted from manually constructed result files.

Summary file summary.json#

One file per run, updated after every task completes:

{
  "split": "Short",
  "sr_split": 0.61,
  "sr_per_memory_type": {
    "Negative": 0.42,
    "Object": 0.55,
    "Prospective": 0.30,
    "Spatial": 0.72,
    "Temporal": 0.50,
    "Tracking": 0.40
  },
  "tasks": ["RememberColor3-VLA-v0", "..."],
  "per_task_sr": {"RememberColor3-VLA-v0": 0.74, "...": 0.0},
  "per_task_mean_return": {"RememberColor3-VLA-v0": 9.8, "...": 0.0}
}

summary.json is rewritten on disk after every completed task, so a run that is interrupted mid-way leaves a valid partial summary.

Reference Evaluation Script#

examples/eval_demo.py is a working reference built around mikasa_robo_suite.vla.benchmarking. It uses DummyChunkPolicy (random actions) so you can run it end-to-end before plugging in a real model:

examples/eval_demo.py#
"""Run a MIKASA-Robo-VLA benchmark evaluation.

Replace ``DummyChunkPolicy`` with your own policy adapter.  The benchmark
runner handles CSV split loading, canonical per-task seeds, action chunk
queues, ``success_once`` latching, JSON files, and split aggregation.

Results are saved to a timestamped subdirectory under ``--output-dir`` so
successive runs never overwrite each other.

Examples (run from the repository root)
----------------------------------------
Smoke-test one task, one episode::

    uv run python examples/eval_demo.py \
        --num-episodes 1 --sim-backend cpu \
        --output-dir eval_results/dummy

Full canonical Short-split run, 50 episodes per task::

    uv run python examples/eval_demo.py \
        --split short \
        --output-dir results/my_model

Specific tasks::

    uv run python examples/eval_demo.py \
        --task RememberColor3-VLA-v0 \
        --task ShellGameTouch-VLA-v0 \
        --output-dir results/my_model

All 90 benchmark tasks::

    uv run python examples/eval_demo.py \
        --split all \
        --output-dir results/my_model

With per-episode rollout videos::

    uv run python examples/eval_demo.py \
        --split short \
        --save-videos \
        --output-dir results/my_model
"""

from __future__ import annotations

import warnings
warnings.filterwarnings("ignore")

import argparse
import json
from pathlib import Path
from typing import List, Mapping, Optional, Sequence, Tuple

import torch

from mikasa_robo_suite.vla.benchmarking import (
    NUM_EPISODES_PER_TASK,
    START_SEED,
    BenchmarkConfig,
    JsonDict,
    RichBenchmarkUI,
    evaluate_benchmark,
    make_run_dir,
    select_benchmark_tasks,
)


class DummyChunkPolicy:
    """Return random action chunks in the canonical 7D EE-delta action space."""

    def __init__(self, chunk_size: int = 8, action_dim: int = 7):
        if chunk_size <= 0:
            raise ValueError(f"chunk_size must be > 0, got {chunk_size}")
        self.chunk_size = int(chunk_size)
        self.action_dim = int(action_dim)

    @torch.no_grad()
    def forward(self, obs: Mapping[str, object]) -> torch.Tensor:
        proprio = obs.get("proprio")
        device = proprio.device if torch.is_tensor(proprio) else torch.device("cpu")
        return torch.empty(
            (self.chunk_size, self.action_dim),
            device=device,
            dtype=torch.float32,
        ).uniform_(-1.0, 1.0)


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )

    selection = parser.add_mutually_exclusive_group()
    selection.add_argument(
        "--split",
        choices=("short", "medium", "long", "all"),
        help=(
            "Evaluate every task in the given horizon split, "
            "or 'all' for all 90 benchmark tasks."
        ),
    )
    selection.add_argument(
        "--task",
        action="append",
        dest="tasks",
        metavar="ENV_ID",
        help="Evaluate one env ID. Repeat to build an arbitrary subset.",
    )

    parser.add_argument("--num-episodes", type=int, default=NUM_EPISODES_PER_TASK)
    parser.add_argument("--start-seed", type=int, default=START_SEED)
    parser.add_argument("--chunk-size", type=int, default=8)
    parser.add_argument(
        "--output-dir",
        type=Path,
        default=Path("eval_results") / "dummy",
        help="Base directory. Results go into a timestamped subdirectory.",
    )
    parser.add_argument(
        "--resume",
        type=Path,
        default=None,
        metavar="RUN_DIR",
        help=(
            "Resume an interrupted run. Pass the timestamped run directory "
            "(e.g. eval_results/dummy/short/2026-05-21_18-52-16). "
            "Tasks that already have a result JSON there are skipped. "
            "The split is inferred from existing results; override with --split."
        ),
    )
    parser.add_argument(
        "--save-videos",
        action="store_true",
        help=(
            "Save a rollout video for every episode of every task. "
            "Videos are written to <output-dir>/<timestamp>/videos/<env_id>/."
        ),
    )
    parser.add_argument(
        "--sim-backend",
        default="gpu",
        help="ManiSkill sim backend ('cpu' or 'gpu'). Default: gpu.",
    )
    parser.add_argument("--render-mode", default="all")
    return parser.parse_args()


def _split_label(tasks: Sequence) -> str:
    splits = {task.split for task in tasks}
    return next(iter(splits)) if len(splits) == 1 else "custom"


def _load_resume_state(resume_dir: Path) -> Tuple[List[JsonDict], Optional[str]]:
    """Load completed task results from *resume_dir* and infer the benchmark split.

    Returns ``(completed_results, inferred_split)``.  *inferred_split* is one of
    ``"short"``, ``"medium"``, ``"long"``, ``"all"``, or ``None`` if it cannot
    be determined (user must pass ``--split`` explicitly in that case).
    """
    if not resume_dir.is_dir():
        raise FileNotFoundError(f"--resume directory not found: {resume_dir}")

    completed: List[JsonDict] = []
    for p in sorted(resume_dir.glob("*.json")):
        if p.stem == "summary":
            continue
        with p.open(encoding="utf-8") as f:
            completed.append(json.load(f))

    # Infer split from completed results: if all tasks share one split → that
    # split; if multiple splits are present → "all" (full benchmark run).
    splits = {r.get("split", "").lower() for r in completed}
    splits.discard("")
    if not splits:
        inferred_split: Optional[str] = None
    elif len(splits) == 1:
        inferred_split = next(iter(splits))
    else:
        inferred_split = "all"

    return completed, inferred_split


def main() -> None:
    args = parse_args()

    completed_results: List[JsonDict] = []

    if args.resume is not None:
        # ------------------------------------------------------------------ #
        # Resume mode: load finished tasks, determine remaining work          #
        # ------------------------------------------------------------------ #
        completed_results, inferred_split = _load_resume_state(args.resume)
        completed_ids = {r["env_id"] for r in completed_results}

        # Task selection: explicit flag wins; fall back to inferred split.
        if args.split:
            all_tasks = select_benchmark_tasks(split=args.split)
        elif args.tasks:
            all_tasks = select_benchmark_tasks(env_ids=args.tasks)
        elif inferred_split:
            all_tasks = select_benchmark_tasks(split=inferred_split)
        else:
            raise SystemExit(
                "Cannot infer the task split from the resume directory. "
                "Pass --split or --task explicitly."
            )

        tasks = [t for t in all_tasks if t.env_id not in completed_ids]
        run_dir = args.resume  # write results back into the same directory
    else:
        # ------------------------------------------------------------------ #
        # Fresh run                                                            #
        # ------------------------------------------------------------------ #
        if args.split:
            tasks = select_benchmark_tasks(split=args.split)
        elif args.tasks:
            tasks = select_benchmark_tasks(env_ids=args.tasks)
        else:
            tasks = select_benchmark_tasks(env_ids=["RememberColor3-VLA-v0"])

        run_dir = make_run_dir(args.output_dir / _split_label(tasks))

    config = BenchmarkConfig(
        start_seed=args.start_seed,
        n_episodes=args.num_episodes,
        sim_backend=args.sim_backend,
        save_videos=args.save_videos,
    )
    policy = DummyChunkPolicy(chunk_size=args.chunk_size)

    with RichBenchmarkUI(tasks, config.n_episodes, initial_results=completed_results or None) as ui:
        _, summary = evaluate_benchmark(
            tasks,
            policy,
            config,
            output_dir=run_dir,
            model={"name": "dummy-random-chunk-policy", "config": {"chunk_size": policy.chunk_size}},
            task_start_callback=ui.on_task_start,
            episode_callback=ui.on_episode_done,
            task_done_callback=ui.on_task_done,
            initial_results=completed_results if completed_results else None,
        )

    from rich.console import Console
    from rich.table import Table

    console = Console()
    console.print()

    summary_table = Table(
        title="[bold]Benchmark Summary[/bold]",
        show_header=True,
        header_style="bold cyan",
        border_style="blue",
    )
    summary_table.add_column("Memory Type", style="white")
    summary_table.add_column("SR", justify="right")

    for memory_type, sr in summary["sr_per_memory_type"].items():
        color = "bright_green" if sr >= 0.7 else "yellow" if sr >= 0.4 else "red"
        summary_table.add_row(memory_type, f"[{color}]{sr:.2%}[/{color}]")

    summary_table.add_section()
    sr_split = summary["sr_split"]
    split_color = "bright_green" if sr_split >= 0.7 else "yellow" if sr_split >= 0.4 else "red"
    summary_table.add_row(
        "[bold]Overall SR[/bold]",
        f"[bold {split_color}]{sr_split:.2%}[/bold {split_color}]",
    )

    console.print(summary_table)
    console.print(f"[dim]JSON results → {run_dir}[/dim]")
    if args.save_videos:
        console.print(f"[dim]Videos      → {run_dir / 'videos'}[/dim]")


if __name__ == "__main__":
    main()

Run from the repository root:

# One-task smoke-test, one episode, CPU backend
uv run python examples/eval_demo.py \
    --num-episodes 1 --sim-backend cpu \
    --output-dir eval_results/dummy

# Canonical Short-split run
uv run python examples/eval_demo.py \
    --split short \
    --output-dir results/my_model

# With per-episode rollout videos
uv run python examples/eval_demo.py \
    --split short \
    --save-videos \
    --output-dir results/my_model

See Benchmarking for the full CLI reference and Python API.

What to Report in Papers#

Reproducibility checklist:

  • [ ] Benchmark git commit.

  • [ ] Split evaluated (Short / Medium / Long, or explicit subset + label).

  • [ ] Start seed and episodes per task (default: 4242424242 × 50).

  • [ ] GPU model and count.

  • [ ] Training dataset (Hugging Face repo).

  • [ ] Training compute (GPU hours or gradient steps) and config (LoRA rank, etc.).

  • [ ] Action chunk size K.

  • [ ] Required: per-split SR, full per-task SR table, per-memory-type SR.

  • [ ] Mean return per task (optional debug metric).

  • [ ] Whether oracle_info or task_cue were passed to the policy. Both are RL-side fields (task_cue only exists for Rotate* and is already encoded in the language instruction); reading either disqualifies the run from the canonical leaderboard.

Cross-Split and Partial Evaluation#

Evaluating on a different split than the training split, or on a custom subset, is allowed but must be reported separately with an explicit label such as “trained on Short, evaluated on Medium”. Do not mix such results into the same SR table as canonical runs.