Evaluation Protocol =================== This page is the canonical specification for evaluating a trained VLA model on MIKASA-Robo-VLA. For how to run evaluations in practice — CLI flags, output layout, Python API — see :doc:`benchmarking`. .. contents:: On this page :depth: 2 :local: Protocol at a Glance -------------------- .. list-table:: :header-rows: 1 :widths: 38 62 * - Parameter - Canonical value * - Split - Chosen explicitly by the user (``short`` / ``medium`` / ``long``). No auto-detection from a checkpoint. * - Episodes per task - **50** * - Seeds - ``4242424242, 4242424243, …, 4242424291`` (same stream for every task) * - Parallel envs - ``num_envs=1`` (default); ``num_envs > 1`` allowed as opt-in speedup * - ``obs_mode`` - ``"rgb"`` * - ``control_mode`` - ``"pd_ee_delta_pose"`` * - ``reward_mode`` - ``"normalized_dense"`` * - Wrapper stack - ``apply_mikasa_vla_wrappers(env, include_overlays=False)`` * - Main metric - ``success_once`` per episode * - Debug metric - ``mean_return`` per task (not aggregated to split level) Observation and Action Format ------------------------------ After ``gym.make`` + ``apply_mikasa_vla_wrappers(env, include_overlays=False)``, ``obs`` always contains ``obs["rgb"]`` (``(B, 128, 128, 6)`` uint8 — two cameras concatenated on the channel axis) and ``obs["proprio"]`` (``(B, 7)`` float32 — absolute EEF pose + gripper opening). The *Rotate\** family additionally exposes ``obs["task_cue"]`` (target angle for the RL oracle); for all other tasks this key is dropped by the canonical helper. ``oracle_info`` is never exposed in the VLA-facing observation. The action is a ``(B, 7)`` ``float32`` tensor in the normalised ``pd_ee_delta_pose`` action space; all seven components are clamped to ``[-1, 1]``. The canonical, field-by-field specification — value ranges, units, the exact gripper conventions, and the difference between ``obs["proprio"][..., 6]`` and ``action[6]`` — lives in :doc:`observation_space`. All evaluation pipelines (online runs, dataset collection, published datasets) use exactly the same layout, so the :doc:`observation_space` reference is authoritative for every context. .. important:: A canonical leaderboard policy consumes only the camera images, ``obs["proprio"]``, and ``info["language_instruction"]``. ``oracle_info`` is privileged and stripped from the VLA-facing observation. ``task_cue`` exists solely as an RL-training signal for the *Rotate\** family — the same target angle is also embedded in the natural-language instruction, so VLAs should ignore the ``task_cue`` key entirely. Reading either field disqualifies a run from the canonical leaderboard. Action chunking ~~~~~~~~~~~~~~~ VLA policies typically predict a chunk of ``K`` actions per forward pass. The benchmark runner consumes them with a ``collections.deque`` (FIFO): 1. At the start of every episode a **fresh, empty** ``deque`` is created. 2. On each simulator step, if the deque is empty, ``policy.forward(obs)`` is called to produce ``(K, action_dim)`` actions which are pushed onto the deque with ``extend``. 3. One action is popped from the front with ``popleft`` and passed to ``env.step(action)``. For ``K=1`` this is identical to standard per-step inference: the deque is refilled on every step. See ``evaluate_task`` in ``mikasa_robo_suite/vla/benchmarking.py`` for the exact implementation. Metrics ------- ``success_once`` (main metric) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``success_once`` is a **boolean latch**, not a count. At each simulator step the runner performs:: success_once = success_once or bool(info["success"]) # OR-latch so the value flips from ``False`` to ``True`` the *first* time the task succeeds and stays ``True`` for the remainder of the episode. The canonical per-episode metric is the final value of this latch:: success_once = any(info["success"] is True over all simulator steps) This unifies tasks that succeed mid-episode (e.g. *ShellGameTouch*) and tasks that only emit success on the final step (e.g. *TimedTransfer*) under a single binary outcome. See ``mikasa_robo_suite/vla/benchmarking.py`` (``evaluate_task``) for the exact code path. ``return`` (debug only) ~~~~~~~~~~~~~~~~~~~~~~~~ :: return = sum(reward) over all steps of the episode Use this to inspect reward shaping or training quality. Do not aggregate it to split level or use it as a comparison number. Reported SR levels ~~~~~~~~~~~~~~~~~~ Three numbers are required per evaluation run: .. list-table:: :header-rows: 1 :widths: 32 68 * - Metric - Definition * - **Per-task SR** - ``SR_task = mean(success_once)`` over the 50 episodes. * - **Per-split SR** *(main)* - ``SR_split = mean(SR_task for all tasks in the split)``. * - **Per-memory-type SR** - For each memory type present in the split: ``SR_mem = mean(SR_task for tasks with that memory_type)``. Seeding ------- For **every** task independently, run 50 episodes with seeds:: env.reset(seed=4242424242 + i) for i in 0, 1, …, 49 The starting seed ``4242424242`` matches the one used by muVLA when it was evaluated on MIKASA-Robo, so results are directly comparable to that baseline. .. note:: Seeds are the same across tasks — each task independently starts from ``4242424242``. This gives full determinism, but it means the per-episode initial states are **correlated across tasks**. Two tasks evaluated under this protocol are not statistically independent. Account for this when computing confidence intervals or aggregate statistics. ``num_envs`` ------------ The canonical evaluation uses ``num_envs=1``. The episode-to-seed mapping is then trivial: episode ``i`` is the state produced by ``env.reset(seed=4242424242 + i)``. For wall-clock speedups (especially for the Long split) you may use ``num_envs > 1`` as an opt-in optimisation, but each parallel env must be seeded so that the set of 50 ``(env_id, episode_idx)`` pairs is identical to the ``num_envs=1`` run. The reported SR must match. Wrapper Stack ------------- Always construct the env with:: env = gym.make( env_id, num_envs=1, obs_mode="rgb", control_mode="pd_ee_delta_pose", reward_mode="normalized_dense", render_mode="all", ) env = apply_mikasa_vla_wrappers(env, include_overlays=False) ``include_overlays=False`` strips debug/render overlays so they cannot affect timing or bleed into recorded observations. The benchmark runner (``evaluate_benchmark``) does this automatically. JSON Output Schema ------------------ Per-task file ``.json`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One file per evaluated env ID: .. code-block:: json { "env_id": "RememberColor3-VLA-v0", "split": "Short", "memory_type": "Object", "start_seed": 4242424242, "n_episodes": 50, "successes": [true, false, true, "..."], "returns": [12.4, 0.0, 15.7, "..."], "sr": 0.74, "mean_return": 9.8, "benchmark_commit": "", "control_mode": "pd_ee_delta_pose", "obs_mode": "rgb", "wrapper_chain": "apply_mikasa_vla_wrappers(include_overlays=False)", "action_chunk_size": 8, "model": {"name": "my-vla", "config": {}}, "episode_lengths": [30, 28, 30, "..."], "episode_seeds": [4242424242, 4242424243, "..."] } ``successes`` and ``returns`` have length ``n_episodes``. ``sr`` and ``mean_return`` are their means. ``episode_lengths`` (the number of simulator steps in each episode before termination) and ``episode_seeds`` (``start_seed + i`` for ``i`` in ``0…n_episodes-1``) are always written by ``evaluate_benchmark`` and may be safely consumed by downstream tools, but they are not required for the canonical SR comparison and may be omitted from manually constructed result files. Summary file ``summary.json`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ One file per run, updated after every task completes: .. code-block:: json { "split": "Short", "sr_split": 0.61, "sr_per_memory_type": { "Negative": 0.42, "Object": 0.55, "Prospective": 0.30, "Spatial": 0.72, "Temporal": 0.50, "Tracking": 0.40 }, "tasks": ["RememberColor3-VLA-v0", "..."], "per_task_sr": {"RememberColor3-VLA-v0": 0.74, "...": 0.0}, "per_task_mean_return": {"RememberColor3-VLA-v0": 9.8, "...": 0.0} } ``summary.json`` is rewritten on disk after every completed task, so a run that is interrupted mid-way leaves a valid partial summary. Reference Evaluation Script ---------------------------- ``examples/eval_demo.py`` is a working reference built around ``mikasa_robo_suite.vla.benchmarking``. It uses ``DummyChunkPolicy`` (random actions) so you can run it end-to-end before plugging in a real model: .. literalinclude:: ../../examples/eval_demo.py :language: python :caption: examples/eval_demo.py Run from the repository root:: # One-task smoke-test, one episode, CPU backend uv run python examples/eval_demo.py \ --num-episodes 1 --sim-backend cpu \ --output-dir eval_results/dummy # Canonical Short-split run uv run python examples/eval_demo.py \ --split short \ --output-dir results/my_model # With per-episode rollout videos uv run python examples/eval_demo.py \ --split short \ --save-videos \ --output-dir results/my_model See :doc:`benchmarking` for the full CLI reference and Python API. What to Report in Papers ------------------------ Reproducibility checklist: - [ ] Benchmark git commit. - [ ] Split evaluated (Short / Medium / Long, or explicit subset + label). - [ ] Start seed and episodes per task (default: ``4242424242`` × 50). - [ ] GPU model and count. - [ ] Training dataset (Hugging Face repo). - [ ] Training compute (GPU hours or gradient steps) and config (LoRA rank, etc.). - [ ] Action chunk size ``K``. - [ ] **Required**: per-split SR, full per-task SR table, per-memory-type SR. - [ ] Mean return per task (optional debug metric). - [ ] Whether ``oracle_info`` or ``task_cue`` were passed to the policy. Both are RL-side fields (``task_cue`` only exists for *Rotate\** and is already encoded in the language instruction); reading either disqualifies the run from the canonical leaderboard. Cross-Split and Partial Evaluation ----------------------------------- Evaluating on a different split than the training split, or on a custom subset, is allowed but must be reported separately with an explicit label such as *"trained on Short, evaluated on Medium"*. Do not mix such results into the same SR table as canonical runs.