Quick Start#
The examples below assume the project is installed with uv sync --frozen
from the repository root.
Minimal VLA Run#
Important
Every MIKASA-Robo-VLA environment must be wrapped with
apply_mikasa_vla_wrappers()
immediately after gym.make and before the first env.reset(). This wrapper
is required for correct benchmark behavior: it applies the task-specific VLA
logic and makes the environment inputs and outputs match the format used in the
released datasets. The helper selects the correct wrapper chain for any of the
90 benchmark tasks, so it should always be used instead of manually composing
wrappers.
import gymnasium as gym
import torch
import mikasa_robo_suite.vla.memory_envs # registers all VLA env IDs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env = gym.make(
"RememberColor3-VLA-v0",
num_envs=1,
obs_mode="rgb",
control_mode="pd_ee_delta_pose",
reward_mode="normalized_dense",
render_mode="all",
sim_backend="gpu",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=False)
obs, info = env.reset(seed=42)
for _ in range(env.max_episode_steps):
action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device)
obs, reward, terminated, truncated, info = env.step(action)
if torch.as_tensor(terminated | truncated).any():
break
env.close()
include_overlays=False omits debug text overlays from rendered frames (if render).
include_overlays=True when generating human-watchable videos.
Listing available tasks#
The canonical task list is in mikasa_robo_vla_envs.csv:
import csv
with open("mikasa_robo_vla_envs.csv", newline="", encoding="utf-8") as f:
for row in csv.DictReader(f, delimiter=";"):
print(row["Name"], row["Horizon Split"], row["Max Length"])
gym.make also accepts any env ID registered in
mikasa_robo_suite/vla/memory_envs/, including development variants not
in the CSV.
What obs contains#
After apply_mikasa_vla_wrappers(env), obs is a Python dict with the
following keys. B is num_envs.
obs["rgb"] # shape (B, 128, 128, 6), uint8
obs["proprio"] # shape (B, 7), float32
obs.get("task_cue") # Rotate* only: target angle for the RL oracle; VLAs ignore
RGB images#
obs["rgb"] concatenates two cameras along the channel axis:
top_rgb = obs["rgb"][..., :3] # base_camera, top-down view
wrist_rgb = obs["rgb"][..., 3:6] # hand_camera, wrist-mounted view
Both cameras produce 128 × 128 images by default.
Proprioception#
obs["proprio"] is the 7D absolute end-effector state.
It is not joint angles and it is not action deltas:
obs["proprio"] = [
eef_x, # [0] TCP position x, metres, not normalised
eef_y, # [1] TCP position y, metres, not normalised
eef_z, # [2] TCP position z, metres, not normalised
eef_roll, # [3] TCP roll, radians, [-pi, pi]
eef_pitch, # [4] TCP pitch, radians, [-pi/2, pi/2]
eef_yaw, # [5] TCP yaw, radians, [-pi, pi]
gripper_opening, # [6] sum of two Panda finger qpos, metres, [0.0, 0.08]
]
gripper_opening is a single scalar. The Panda gripper has two symmetric
fingers; each finger’s qpos range is [0.0, 0.04] m, so the sum is
[0.0, 0.08] m. A value of 0.0 means fully closed; 0.08 means
fully open. The value is not normalised.
task_cue and language_instruction#
obs["task_cue"] exists only for the Rotate* tasks, where the
target rotation angle (in degrees) cannot be inferred from a single RGB
frame. It is provided so PPO RL oracles can be trained from images; the
canonical VLA helper drops the key for every other task family.
For VLA policies the cue is redundant: the same numeric value is already
embedded in info["language_instruction"] (e.g. “Rotate the peg by
30 degrees clockwise”). VLAs should therefore consume only the language
instruction and ignore obs["task_cue"] even when it is present:
# VLA-style: ignore task_cue, use language_instruction
text = info["language_instruction"]
# RL-style: read task_cue directly (Rotate* only)
if "task_cue" in obs:
target_angle = obs["task_cue"]
The canonical helper does not expose oracle_info to the VLA policy
observation. That privileged field is only available in lower-level or manual
debug wrapper stacks.
Language instruction#
info["language_instruction"] is a Python str describing the task
goal, e.g. “Observe the cube’s colour, wait, then touch the matching cube”.
It is returned by both env.reset() and env.step().
Action format#
Pass a 7D float32 tensor to env.step(action).
Single-env shape: (7,). Batched shape: (B, 7).
All seven values are in the normalised range [-1, 1]:
action = [
delta_eef_x, # [0] EEF translation delta, normalised; Panda maps to [-0.1, 0.1] m
delta_eef_y, # [1]
delta_eef_z, # [2]
delta_eef_roll, # [3] EEF rotation delta, normalised; magnitude capped at 0.1 rad
delta_eef_pitch, # [4]
delta_eef_yaw, # [5]
gripper_command, # [6] normalised position target: -1 → close, +1 → open
]
action[0:3] are relative translation commands, not absolute positions.
action[3:6] are relative rotation commands, not absolute orientations.
action[6] (gripper_command) is a position target sent to the Panda
gripper joint controller — it is not a delta of obs["proprio"][..., 6]
(gripper_opening). Sending -1 drives the gripper toward the closed
target; sending +1 drives it toward the open target.
The dataset action field stores these same values; see Datasets
for the full training signal reference.
gym.make parameters#
Parameter |
Canonical value |
Notes |
|---|---|---|
|
|
Use |
|
|
Required for all VLA training, evaluation, and dataset collection. Using a different control mode changes action semantics and breaks dataset compatibility. |
|
|
Values in |
|
|
GPU-parallelised. All returned tensors have a leading batch dim |
|
|
Required for |
|
|
Use |
Video Recording#
Wrap with RecordEpisode after apply_mikasa_vla_wrappers:
import gymnasium as gym
import torch
from mani_skill.utils.wrappers import RecordEpisode
import mikasa_robo_suite.vla.memory_envs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env_name = "RememberColor3-VLA-v0"
env = gym.make(
env_name,
num_envs=1,
obs_mode="rgb",
control_mode="pd_ee_delta_pose",
reward_mode="normalized_dense",
render_mode="all",
sim_backend="gpu",
)
env = apply_mikasa_vla_wrappers(env, include_overlays=True)
env = RecordEpisode(
env,
output_dir=f"./videos/{env_name}",
save_trajectory=False,
max_steps_per_video=env.max_episode_steps,
)
obs, info = env.reset(seed=42)
for _ in range(env.max_episode_steps):
action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device)
obs, reward, terminated, truncated, info = env.step(action)
env.close()
# video written to ./videos/RememberColor3-VLA-v0/0.mp4
include_overlays=True renders task-state overlays (step counter, reward,
task-specific debug info) on top of the video frames.
Benchmark Demo GIF/MP4#
uv run python utils/prepare_benchmark_demo_videos.py \
--tasks RememberColor3-VLA-v0 \
--output-dir videos/benchmark_demos \
--max-attempts-per-task 8 \
--overwrite