Quick Start#

The examples below assume the project is installed with uv sync --frozen from the repository root.

Minimal VLA Run#

Important

Every MIKASA-Robo-VLA environment must be wrapped with apply_mikasa_vla_wrappers() immediately after gym.make and before the first env.reset(). This wrapper is required for correct benchmark behavior: it applies the task-specific VLA logic and makes the environment inputs and outputs match the format used in the released datasets. The helper selects the correct wrapper chain for any of the 90 benchmark tasks, so it should always be used instead of manually composing wrappers.

import gymnasium as gym
import torch

import mikasa_robo_suite.vla.memory_envs  # registers all VLA env IDs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env = gym.make(
    "RememberColor3-VLA-v0",
    num_envs=1,
    obs_mode="rgb",
    control_mode="pd_ee_delta_pose",
    reward_mode="normalized_dense",
    render_mode="all",
    sim_backend="gpu",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=False)

obs, info = env.reset(seed=42)

for _ in range(env.max_episode_steps):
    action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device)
    obs, reward, terminated, truncated, info = env.step(action)
    if torch.as_tensor(terminated | truncated).any():
        break

env.close()

include_overlays=False omits debug text overlays from rendered frames (if render).

include_overlays=True when generating human-watchable videos.

Listing available tasks#

The canonical task list is in mikasa_robo_vla_envs.csv:

import csv

with open("mikasa_robo_vla_envs.csv", newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f, delimiter=";"):
        print(row["Name"], row["Horizon Split"], row["Max Length"])

gym.make also accepts any env ID registered in mikasa_robo_suite/vla/memory_envs/, including development variants not in the CSV.

What obs contains#

After apply_mikasa_vla_wrappers(env), obs is a Python dict with the following keys. B is num_envs.

obs["rgb"]            # shape (B, 128, 128, 6), uint8
obs["proprio"]        # shape (B, 7),           float32
obs.get("task_cue")   # Rotate* only: target angle for the RL oracle; VLAs ignore

RGB images#

obs["rgb"] concatenates two cameras along the channel axis:

top_rgb   = obs["rgb"][..., :3]   # base_camera,  top-down view
wrist_rgb = obs["rgb"][..., 3:6]  # hand_camera,  wrist-mounted view

Both cameras produce 128 × 128 images by default.

Proprioception#

obs["proprio"] is the 7D absolute end-effector state. It is not joint angles and it is not action deltas:

obs["proprio"] = [
    eef_x,           # [0]  TCP position x, metres, not normalised
    eef_y,           # [1]  TCP position y, metres, not normalised
    eef_z,           # [2]  TCP position z, metres, not normalised
    eef_roll,        # [3]  TCP roll,  radians, [-pi,   pi]
    eef_pitch,       # [4]  TCP pitch, radians, [-pi/2, pi/2]
    eef_yaw,         # [5]  TCP yaw,   radians, [-pi,   pi]
    gripper_opening, # [6]  sum of two Panda finger qpos, metres, [0.0, 0.08]
]

gripper_opening is a single scalar. The Panda gripper has two symmetric fingers; each finger’s qpos range is [0.0, 0.04] m, so the sum is [0.0, 0.08] m. A value of 0.0 means fully closed; 0.08 means fully open. The value is not normalised.

task_cue and language_instruction#

obs["task_cue"] exists only for the Rotate* tasks, where the target rotation angle (in degrees) cannot be inferred from a single RGB frame. It is provided so PPO RL oracles can be trained from images; the canonical VLA helper drops the key for every other task family.

For VLA policies the cue is redundant: the same numeric value is already embedded in info["language_instruction"] (e.g. “Rotate the peg by 30 degrees clockwise”). VLAs should therefore consume only the language instruction and ignore obs["task_cue"] even when it is present:

# VLA-style: ignore task_cue, use language_instruction
text = info["language_instruction"]

# RL-style: read task_cue directly (Rotate* only)
if "task_cue" in obs:
    target_angle = obs["task_cue"]

The canonical helper does not expose oracle_info to the VLA policy observation. That privileged field is only available in lower-level or manual debug wrapper stacks.

Language instruction#

info["language_instruction"] is a Python str describing the task goal, e.g. “Observe the cube’s colour, wait, then touch the matching cube”. It is returned by both env.reset() and env.step().

Action format#

Pass a 7D float32 tensor to env.step(action). Single-env shape: (7,). Batched shape: (B, 7).

All seven values are in the normalised range [-1, 1]:

action = [
    delta_eef_x,      # [0]  EEF translation delta, normalised; Panda maps to [-0.1, 0.1] m
    delta_eef_y,      # [1]
    delta_eef_z,      # [2]
    delta_eef_roll,   # [3]  EEF rotation delta, normalised; magnitude capped at 0.1 rad
    delta_eef_pitch,  # [4]
    delta_eef_yaw,    # [5]
    gripper_command,  # [6]  normalised position target: -1 → close, +1 → open
]

action[0:3] are relative translation commands, not absolute positions. action[3:6] are relative rotation commands, not absolute orientations. action[6] (gripper_command) is a position target sent to the Panda gripper joint controller — it is not a delta of obs["proprio"][..., 6] (gripper_opening). Sending -1 drives the gripper toward the closed target; sending +1 drives it toward the open target.

The dataset action field stores these same values; see Datasets for the full training signal reference.

gym.make parameters#

Parameter

Canonical value

Notes

obs_mode

"rgb"

Use "state" only for PPO oracle training and reward debugging. Never use "state" for VLA training or benchmark evaluation.

control_mode

"pd_ee_delta_pose"

Required for all VLA training, evaluation, and dataset collection. Using a different control mode changes action semantics and breaks dataset compatibility.

reward_mode

"normalized_dense"

Values in [0, 1]. Use for RL training. Use "sparse" (0 or 1) for success-rate evaluation. "dense" is unscaled and varies per task.

num_envs

1 for eval, larger for collection

GPU-parallelised. All returned tensors have a leading batch dim B.

render_mode

"all" for video; omit for headless

Required for RecordEpisode and env.render().

sim_backend

"gpu"

Use "cpu" only when no GPU is available. CPU simulation is significantly slower and is not recommended for benchmark use, as the CPU and GPU physics backends are not perfectly identical. For reproducible evaluation and dataset-consistent behavior, the GPU backend should be used whenever possible.

Video Recording#

Wrap with RecordEpisode after apply_mikasa_vla_wrappers:

import gymnasium as gym
import torch
from mani_skill.utils.wrappers import RecordEpisode

import mikasa_robo_suite.vla.memory_envs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env_name = "RememberColor3-VLA-v0"

env = gym.make(
    env_name,
    num_envs=1,
    obs_mode="rgb",
    control_mode="pd_ee_delta_pose",
    reward_mode="normalized_dense",
    render_mode="all",
    sim_backend="gpu",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=True)
env = RecordEpisode(
    env,
    output_dir=f"./videos/{env_name}",
    save_trajectory=False,
    max_steps_per_video=env.max_episode_steps,
)

obs, info = env.reset(seed=42)
for _ in range(env.max_episode_steps):
    action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device)
    obs, reward, terminated, truncated, info = env.step(action)

env.close()
# video written to ./videos/RememberColor3-VLA-v0/0.mp4

include_overlays=True renders task-state overlays (step counter, reward, task-specific debug info) on top of the video frames.

Benchmark Demo GIF/MP4#

uv run python utils/prepare_benchmark_demo_videos.py \
  --tasks RememberColor3-VLA-v0 \
  --output-dir videos/benchmark_demos \
  --max-attempts-per-task 8 \
  --overwrite