Core Concepts#
This page explains the key ideas behind MIKASA-Robo-VLA: what makes the tasks memory-intensive, how episodes are structured, and how the benchmark relates to the original RL release.
What Is MIKASA-Robo-VLA?#
MIKASA-Robo-VLA is a benchmark suite of 90 tabletop manipulation tasks for evaluating Vision-Language-Action (VLA) models under partial observability. In every task, the agent must retain, update, or continuously track information across time in order to act correctly. The required memory may involve the colour of a previously observed object, the order of targets in a sequence, the location of a hidden item after shuffling, the number of events accumulated so far, or other temporally distributed cues that are no longer directly observable at decision time.
The benchmark extends the original MIKASA-Robo RL release (ICLR 2026) in many ways:
Task count grows from 32 to 90 tasks, covering a wider range of memory types (10 vs 4), horizon lengths (25 – 2160 steps), and difficulty levels.
New memory types: Temporal, Prospective, Tracking, Checklist, Negative, and Procedural — not present in the original RL release.
Language instructions: every task ships with a natural-language
LANGUAGE_INSTRUCTION, enabling VLA models to condition on text.Calibrated dense rewards: all environments in
mikasa_robo_suite/vla/have carefully tuned dense and normalised-dense reward functions, making them suitable not only for offline VLA evaluation but also for online RL training and reward-signal research.Published datasets: 22,500 trajectories (>6 M timesteps) in RLDS and LeRobotDataset v3 formats on Hugging Face, ready for imitation learning without any further conversion.
Important
The earlier RL benchmark is available from the
mikasa-robo-rl branch.
Its implementation lives under mikasa_robo_suite/rl/ and is kept for
backwards compatibility. All active development and new environments target
mikasa_robo_suite/vla/.
Episode Structure#
Most tasks in the benchmark follow a three-phase structure:
┌──────────────┬──────────────────────┬──────────────────┐
│ Cue phase │ Memory phase │ Action phase │
│ (observe) │ (retain / track) │ (act on memory) │
└──────────────┴──────────────────────┴──────────────────┘
- Cue phase
The environment presents the information the agent must remember — a coloured light, a sequence of objects, a count of blinks. The agent typically executes a no-op action during this phase.
- Memory phase
The cue disappears or the scene changes. The agent must retain the relevant information internally. This is the phase that separates memory-capable agents from reactive ones.
- Action phase
The agent uses the memorised information to complete the task — touching the correct cup, pressing the right button, returning an object to its original position.
Note
Not all tasks follow this exact three-phase layout. Some tasks require continuous memory or tracking across the entire episode:
Tracking tasks (ShellGameShuffle): the agent must maintain an up-to-date estimate of a hidden object’s location as objects are shuffled throughout the episode.
Procedural tasks (TraceShape, TraceShapeSeq): the agent must recall a shape or sequence and re-execute it step-by-step with fine motor control.
Prospective tasks (GatherAndRecall): the agent observes a future goal, completes an intermediate task, then returns to fulfil the original intention.
Temporal tasks (BlinkCountButtonPress): information accumulates incrementally over time — the agent must integrate observations across steps rather than memorise a single cue.
The length of each phase, and therefore the horizon split of the task, is set by class-level constants in the environment implementation.
Memory Types#
Tasks are grouped by the type of memory they exercise. The full per-task
breakdown is in mikasa_robo_vla_envs.csv.
Memory Type |
# Tasks |
What the agent must remember |
|---|---|---|
Object |
18 |
Identity, colour, or shape of a specific object shown during the cue phase (e.g. RememberColor, FindImposter). |
Spatial |
14 |
Location of an object that is then hidden from view (e.g. ShellGameTouch, ShellGamePush). |
Capacity |
12 |
An unordered set of items or a count (e.g. BunchOfColors, BatteriesChecker). |
Temporal |
12 |
A signal that accumulates over time, such as a blink count or a timed cue (e.g. BlinkCountButtonPress, TimedTransfer). |
Negative |
9 |
What the agent must not do — identify the odd-one-out (e.g. FindImposterColor, FindImposterShape). |
Sequential |
6 |
An ordered sequence of targets (e.g. SeqOfColors, ChainOfColors). |
Procedural |
6 |
A motor procedure that must be recalled and re-executed (e.g. TraceShape, TraceShapeSeq). |
Prospective |
5 |
A future intention — the agent sees a goal early in the episode, completes an intermediate task, then returns to fulfil the goal (e.g. GatherAndRecall). |
Tracking |
4 |
Multiple objects that are continuously shuffled or moved while hidden; the agent must maintain a dynamic estimate of their positions throughout the episode (e.g. ShellGameShuffle). |
Checklist |
4 |
A set of conditions all of which must be satisfied in any order (e.g. BatteriesCheckerHard). |
Horizon Splits#
Training a VLA on all 90 tasks simultaneously is difficult because episode lengths range from 25 to 2160 steps. MIKASA-Robo-VLA therefore defines three horizon splits for reproducible multi-task evaluation:
Split |
Tasks |
Horizon range |
Typical memory demand |
|---|---|---|---|
Short |
38 |
25 – 200 steps |
Rapid cue encoding and short-term recall. |
Medium |
30 |
201 – 601 steps |
Sustained working memory over moderately long episodes. |
Long |
22 |
602 – 2160 steps |
Extended memory, multi-phase reasoning, procedural recall. |
See Benchmarking for the canonical evaluation protocol that uses these splits.
Observation Modes and the task_cue / oracle_info Fields#
VLA environments expose two observation modes:
obs_mode="state"Privileged simulator state as a flat tensor.
Warning
This mode is not used for VLA training or benchmarking. It is intended solely for PPO oracle training, reward calibration, and generating ground-truth labels. Always use
obs_mode="rgb"for VLA evaluation.obs_mode="rgb"RGB images from the top-down and wrist-mounted cameras, plus 7D proprioception
obs["proprio"]. This is the standard mode for VLA training and evaluation. Afterapply_mikasa_vla_wrappers()the proprioception vector is the absolute EEF pose plus gripper opening:obs["proprio"] = [eef_x, eef_y, eef_z, eef_roll, eef_pitch, eef_yaw, gripper_opening]
and the action accepted by
env.step(action)is a relative delta plus gripper position command (all values in[-1, 1]):action = [delta_eef_x, delta_eef_y, delta_eef_z, delta_eef_roll, delta_eef_pitch, delta_eef_yaw, gripper_command]
See Observation and Action Space for the complete field-by-field reference with units and ranges.
Both modes are extended by
StateOnlyTensorToDictWrapper
with two extra fields:
``task_cue`` — a small numerical tensor exposed only by the Rotate* family of environments. A single RGB frame cannot convey the exact target rotation angle, so RL agents trained on those tasks need the angle (in degrees) as an explicit observation channel. The cue is consumed by the PPO oracles during dataset collection; it is not required by VLA policies, because the same information is embedded inside every task’s
language_instruction(e.g. “Rotate the peg by 30 degrees clockwise”). For all non-Rotate tasksStateOnlyTensorToDictWrapperreturns the sentinel value4242424242, and the canonicalapply_mikasa_vla_wrappers()helper drops the key from the VLA-facing observation entirely.``oracle_info`` — additional privileged information for evaluation or debugging (e.g. ground-truth object position). Always
4242424242when not available, and stripped from the VLA-facing observation by the canonical wrapper.
Tip
task_cue vs language_instruction. language_instruction is a
human-readable text string (e.g. “Observe the cube’s colour, wait, then
touch the cube of the same colour”) and is the only task-conditioning
channel a VLA model needs. task_cue is a numeric encoding of the
same information that exists solely for RL baselines on Rotate* —
VLA pipelines should ignore it.
The canonical wrapped observation therefore contains only rgb and
proprio for almost all tasks, plus task_cue for Rotate*.
See Observation and Action Space for the field-by-field layout and
Wrappers Cookbook for how the chain composes.
Dense Reward Functions#
A key property of MIKASA-Robo-VLA is that all 90 environments include calibrated dense reward functions. Unlike sparse rewards that provide a signal only on task completion, the dense rewards break down each task into interpretable sub-goals (approach, grasp, reach, release, …) and assign incremental reward at each sub-step.
This has two practical consequences:
PPO oracle training converges reliably even for tasks with long horizons, making it possible to generate high-quality demonstration datasets.
Researchers who want to train RL agents online — rather than fine-tuning a VLA offline — can do so with the same environments without any reward engineering.
Use reward_mode="normalized_dense" (values in [0, 1]) for RL training, or
reward_mode="sparse" if you only care about task success during evaluation.
See Observation and Action Space for the full reward-mode reference.
Dataset Formats#
Trajectories collected with the PPO oracle or motion-planning scripts are stored in three formats (in order of the collection pipeline):
NPZ (raw episodes)
└─► RLDS / TensorFlow Datasets (episodic, Open-X style)
└─► LeRobotDataset v3 (Parquet + MP4, modern imitation learning)
NPZ is the internal collection format. RLDS and LeRobotDataset v3 are the formats published on Hugging Face and recommended for VLA training. See Datasets for download links and conversion commands.