Quick Start =========== The examples below assume the project is installed with ``uv sync --frozen`` from the repository root. Minimal VLA Run --------------- .. important:: Every MIKASA-Robo-VLA environment must be wrapped with :func:`~mikasa_robo_suite.vla.utils.apply_wrappers.apply_mikasa_vla_wrappers` immediately after ``gym.make`` and before the first ``env.reset()``. **This wrapper is required for correct benchmark behavior:** it applies the task-specific VLA logic and makes the environment inputs and outputs match the format used in the released datasets. The helper selects the correct wrapper chain for any of the 90 benchmark tasks, so it should always be used instead of manually composing wrappers. .. code-block:: python import gymnasium as gym import torch import mikasa_robo_suite.vla.memory_envs # registers all VLA env IDs from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers env = gym.make( "RememberColor3-VLA-v0", num_envs=1, obs_mode="rgb", control_mode="pd_ee_delta_pose", reward_mode="normalized_dense", render_mode="all", sim_backend="gpu", ) env = apply_mikasa_vla_wrappers(env, include_overlays=False) obs, info = env.reset(seed=42) for _ in range(env.max_episode_steps): action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device) obs, reward, terminated, truncated, info = env.step(action) if torch.as_tensor(terminated | truncated).any(): break env.close() ``include_overlays=False`` omits debug text overlays from rendered frames (if render). ``include_overlays=True`` when generating human-watchable videos. Listing available tasks ----------------------- The canonical task list is in ``mikasa_robo_vla_envs.csv``: .. code-block:: python import csv with open("mikasa_robo_vla_envs.csv", newline="", encoding="utf-8") as f: for row in csv.DictReader(f, delimiter=";"): print(row["Name"], row["Horizon Split"], row["Max Length"]) ``gym.make`` also accepts any env ID registered in ``mikasa_robo_suite/vla/memory_envs/``, including development variants not in the CSV. What ``obs`` contains --------------------- After ``apply_mikasa_vla_wrappers(env)``, ``obs`` is a Python dict with the following keys. ``B`` is ``num_envs``. .. code-block:: python obs["rgb"] # shape (B, 128, 128, 6), uint8 obs["proprio"] # shape (B, 7), float32 obs.get("task_cue") # Rotate* only: target angle for the RL oracle; VLAs ignore RGB images ~~~~~~~~~~ ``obs["rgb"]`` concatenates two cameras along the channel axis:: top_rgb = obs["rgb"][..., :3] # base_camera, top-down view wrist_rgb = obs["rgb"][..., 3:6] # hand_camera, wrist-mounted view Both cameras produce 128 × 128 images by default. Proprioception ~~~~~~~~~~~~~~ ``obs["proprio"]`` is the 7D absolute end-effector state. It is **not** joint angles and it is **not** action deltas:: obs["proprio"] = [ eef_x, # [0] TCP position x, metres, not normalised eef_y, # [1] TCP position y, metres, not normalised eef_z, # [2] TCP position z, metres, not normalised eef_roll, # [3] TCP roll, radians, [-pi, pi] eef_pitch, # [4] TCP pitch, radians, [-pi/2, pi/2] eef_yaw, # [5] TCP yaw, radians, [-pi, pi] gripper_opening, # [6] sum of two Panda finger qpos, metres, [0.0, 0.08] ] ``gripper_opening`` is a single scalar. The Panda gripper has two symmetric fingers; each finger's qpos range is ``[0.0, 0.04]`` m, so the sum is ``[0.0, 0.08]`` m. A value of ``0.0`` means fully closed; ``0.08`` means fully open. The value is **not** normalised. task_cue and language_instruction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``obs["task_cue"]`` exists **only for the Rotate\* tasks**, where the target rotation angle (in degrees) cannot be inferred from a single RGB frame. It is provided so PPO RL oracles can be trained from images; the canonical VLA helper drops the key for every other task family. For VLA policies the cue is redundant: the same numeric value is already embedded in ``info["language_instruction"]`` (e.g. *"Rotate the peg by 30 degrees clockwise"*). VLAs should therefore consume only the language instruction and ignore ``obs["task_cue"]`` even when it is present:: # VLA-style: ignore task_cue, use language_instruction text = info["language_instruction"] # RL-style: read task_cue directly (Rotate* only) if "task_cue" in obs: target_angle = obs["task_cue"] The canonical helper does not expose ``oracle_info`` to the VLA policy observation. That privileged field is only available in lower-level or manual debug wrapper stacks. Language instruction ~~~~~~~~~~~~~~~~~~~~ ``info["language_instruction"]`` is a Python ``str`` describing the task goal, e.g. *"Observe the cube's colour, wait, then touch the matching cube"*. It is returned by both ``env.reset()`` and ``env.step()``. Action format ------------- Pass a 7D ``float32`` tensor to ``env.step(action)``. Single-env shape: ``(7,)``. Batched shape: ``(B, 7)``. All seven values are in the normalised range ``[-1, 1]``:: action = [ delta_eef_x, # [0] EEF translation delta, normalised; Panda maps to [-0.1, 0.1] m delta_eef_y, # [1] delta_eef_z, # [2] delta_eef_roll, # [3] EEF rotation delta, normalised; magnitude capped at 0.1 rad delta_eef_pitch, # [4] delta_eef_yaw, # [5] gripper_command, # [6] normalised position target: -1 → close, +1 → open ] ``action[0:3]`` are *relative* translation commands, not absolute positions. ``action[3:6]`` are *relative* rotation commands, not absolute orientations. ``action[6]`` (``gripper_command``) is a *position target* sent to the Panda gripper joint controller — it is **not** a delta of ``obs["proprio"][..., 6]`` (``gripper_opening``). Sending ``-1`` drives the gripper toward the closed target; sending ``+1`` drives it toward the open target. The dataset ``action`` field stores these same values; see :doc:`datasets` for the full training signal reference. gym.make parameters ------------------- .. list-table:: :header-rows: 1 :widths: 28 20 52 * - Parameter - Canonical value - Notes * - ``obs_mode`` - ``"rgb"`` - Use ``"state"`` only for PPO oracle training and reward debugging. Never use ``"state"`` for VLA training or benchmark evaluation. * - ``control_mode`` - ``"pd_ee_delta_pose"`` - Required for all VLA training, evaluation, and dataset collection. Using a different control mode changes action semantics and breaks dataset compatibility. * - ``reward_mode`` - ``"normalized_dense"`` - Values in ``[0, 1]``. Use for RL training. Use ``"sparse"`` (0 or 1) for success-rate evaluation. ``"dense"`` is unscaled and varies per task. * - ``num_envs`` - ``1`` for eval, larger for collection - GPU-parallelised. All returned tensors have a leading batch dim ``B``. * - ``render_mode`` - ``"all"`` for video; omit for headless - Required for ``RecordEpisode`` and ``env.render()``. * - ``sim_backend`` - ``"gpu"`` - Use ``"cpu"`` only when no GPU is available. CPU simulation is significantly slower and is not recommended for benchmark use, as the CPU and GPU physics backends are not perfectly identical. For reproducible evaluation and dataset-consistent behavior, the GPU backend should be used whenever possible. Video Recording --------------- Wrap with ``RecordEpisode`` *after* ``apply_mikasa_vla_wrappers``: .. code-block:: python import gymnasium as gym import torch from mani_skill.utils.wrappers import RecordEpisode import mikasa_robo_suite.vla.memory_envs from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers env_name = "RememberColor3-VLA-v0" env = gym.make( env_name, num_envs=1, obs_mode="rgb", control_mode="pd_ee_delta_pose", reward_mode="normalized_dense", render_mode="all", sim_backend="gpu", ) env = apply_mikasa_vla_wrappers(env, include_overlays=True) env = RecordEpisode( env, output_dir=f"./videos/{env_name}", save_trajectory=False, max_steps_per_video=env.max_episode_steps, ) obs, info = env.reset(seed=42) for _ in range(env.max_episode_steps): action = torch.as_tensor(env.action_space.sample(), device=env.unwrapped.device) obs, reward, terminated, truncated, info = env.step(action) env.close() # video written to ./videos/RememberColor3-VLA-v0/0.mp4 ``include_overlays=True`` renders task-state overlays (step counter, reward, task-specific debug info) on top of the video frames. Benchmark Demo GIF/MP4 ----------------------- .. code-block:: bash uv run python utils/prepare_benchmark_demo_videos.py \ --tasks RememberColor3-VLA-v0 \ --output-dir videos/benchmark_demos \ --max-attempts-per-task 8 \ --overwrite