Wrappers Cookbook#

MIKASA-Robo-VLA ships 19 Gymnasium wrappers in mikasa_robo_suite/vla/utils/wrappers.py plus a one-call helper apply_mikasa_vla_wrappers() that picks the correct per-task chain automatically. This page groups the individual wrappers by purpose and shows when and how to compose them manually if you ever need to.

For the full API reference, see Wrappers API.

The Default: `apply_mikasa_vla_wrappers`#

For any of the 90 MIKASA-Robo-VLA tasks, the canonical wrapper stack — state→dict, curriculum noop (where needed), task-specific overlays, RGB-flatten, and EEF proprioception — is applied by a single function:

import gymnasium as gym
import mikasa_robo_suite.vla.memory_envs  # registers VLA env IDs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env = gym.make(
    "RememberColor3-VLA-v0",
    num_envs=1,
    obs_mode="rgb",
    control_mode="pd_ee_delta_pose",
    render_mode="all",
)
env = apply_mikasa_vla_wrappers(env)  # default for every task

The helper is the recommended entry point. It guarantees:

the same core obs format (obs["rgb"] and obs["proprio"]) that the published datasets use, plus obs["task_cue"] only for the Rotate* family (the angle cue used by RL oracles; VLA policies should ignore it because the same value is already in info["language_instruction"]);
the correct cue-phase noop wrapper for VLA tasks that need one (CurriculumPhaseNoopActionWrapper in the canonical pd_ee_delta_pose action space);
task-specific render overlays during env.render() for human-watchable videos.

For headless metric-only evaluations pass include_overlays=False — only the four functional wrappers are kept, no text on rendered frames, but observations and reward are byte-identical to the default:

env = apply_mikasa_vla_wrappers(env, include_overlays=False)

See mikasa_robo_suite.vla.utils.apply_wrappers.apply_mikasa_vla_wrappers() for the full per-env mapping.

Composition Order (manual)#

You only need this section if you are intentionally composing wrappers by hand — for example, to reproduce the dataset collector or to experiment with a non-standard chain. Otherwise use apply_mikasa_vla_wrappers.

Always apply wrappers in this order:

gym.make(...)
  └─ StateOnlyTensorToDictWrapper          ← must be first
      └─ CurriculumPhaseNoopActionWrapper   (cue-phase VLA tasks)
          └─ <task-specific info wrappers>  (overlays)
              └─ <render / debug wrappers>  (overlays, dev only)
                  └─ FlattenRGBDObservationWrapper(rgb=True, joints=True)
                      └─ ConvertJointsToEEFXyzRpyGripperWrapper
                          └─ RecordEpisode  (outermost)

The env_info helper inside the PPO collector returns the same task-specific overlay chain that apply_mikasa_vla_wrappers builds internally, so you can inspect or replicate it manually:

# env_info lives in the PPO collector module; the dataset_collectors
# package is included with mikasa-robo-suite and is importable at runtime.
from mikasa_robo_suite.vla.dataset_collectors.get_mikasa_robo_datasets import env_info

wrappers_list, episode_timeout = env_info("RememberColor9-VLA-v0")
for wrapper_class, wrapper_kwargs in wrappers_list:
    env = wrapper_class(env, **wrapper_kwargs)

Core Wrappers #

These two wrappers are part of the standard pipeline and are almost always required.

StateOnlyTensorToDictWrapper #

Converts the raw tensor observation into a dict and injects the task_cue (filled with sentinel 4242424242 for tasks that do not expose a numeric cue) and oracle_info fields:

from mikasa_robo_suite.vla.utils.wrappers import StateOnlyTensorToDictWrapper

env = gym.make("RememberColor3-VLA-v0", num_envs=1, obs_mode="rgb",
               control_mode="pd_ee_delta_pose", render_mode="all")
env = StateOnlyTensorToDictWrapper(env)
obs, info = env.reset(seed=0)
print(obs.keys())  # dict_keys(['state'/'sensor_data', 'task_cue', 'oracle_info'])

When to use: always — it is the first wrapper in every recommended stack. Downstream wrappers in apply_mikasa_vla_wrappers then drop oracle_info and drop task_cue for every task except the Rotate* family (where PPO oracles need the target angle, and where VLA policies should still ignore it because the angle is already in info["language_instruction"]).

ConvertJointsToEEFXyzRpyGripperWrapper #

Converts the flattened raw joint-state input obs["joints"] into the public VLA proprioception key obs["proprio"] with the 7D end-effector representation xyz(3) + rpy(3) + gripper(1). The raw obs["joints"] key is removed from the output dict.

from mikasa_robo_suite.vla.utils.wrappers import (
    StateOnlyTensorToDictWrapper,
    ConvertJointsToEEFXyzRpyGripperWrapper,
)

env = StateOnlyTensorToDictWrapper(env)
env = ConvertJointsToEEFXyzRpyGripperWrapper(env)

After this wrapper, obs["proprio"] is the canonical 7D proprioception vector used by all VLA datasets and evaluation scripts. See Observation and Action Space for the field-by-field reference (units, ranges, how gripper_opening differs from the gripper_command action).

When to use: for dataset collection and any downstream pipeline that expects the 7D proprio vector (the format used in all published MIKASA-Robo-VLA datasets).

Action-Shaping Wrappers #

InitialZeroActionWrapper #

Executes a fixed number of zero-action steps at the start of each episode. Useful for tasks that require the robot to settle before the cue phase begins.

CurriculumPhaseNoopActionWrapper #

Replaces the agent’s action with a no-op during the cue (and optional empty) phase, while the cue is being shown. This keeps the robot still so that the cue is fully visible, which mirrors the behaviour of the PPO oracle.

When to use: include it when reproducing the train-data rollout setup for any task whose PPO env_info stack contains it. The apply_mikasa_vla_wrappers() helper adds it for those PPO-collected cue-phase tasks. Motion-planning data is exported after replay through a plain pd_ee_delta_pose rollout, so the VLA helper does not add a curriculum action filter for MP-only tasks.

CurriculumPhaseNoopActionWrapperPdJointPos #

pd_joint_pos-aware subclass of CurriculumPhaseNoopActionWrapper. Plain zeros aren’t a “stand still” command in pd_joint_pos (they would drive the robot toward qpos = [0, …, 0]), so during the cue phase this wrapper substitutes the robot’s current qpos plus a normalized gripper command — i.e. hold the current pose.

When to use: in the pd_joint_pos motion-planning oracle scripts for BlinkCountButtonPress*. That hold wrapper is upstream of replay; the published VLA train-data rollout and the VLA helper both expose the canonical unfiltered pd_ee_delta_pose validation path.

CameraShutdownWrapper #

Disables the cameras during the memory phase of tasks where the cameras are explicitly turned off as part of the task design (e.g. BatteriesChecker). This ensures that the agent cannot cheat by looking at occluded objects.

Render / Debug Wrappers #

These wrappers overlay task-specific information on top of the rendered video frame. They are intended for local debugging and video generation; do not use them during benchmark evaluation.

Wrapper	What it overlays
`RenderStepInfoWrapper`	Current step count.
`RenderRewardInfoWrapper`	Per-step reward value.
`RenderPressProgressInfoWrapper`	Button press progress bar (BlinkCountButtonPress tasks).
`RenderWorkingBatteriesInfoWrapper`	Ground-truth working battery positions (BatteriesChecker tasks).
`ShellGameRenderCupInfoWrapper`	Which cup hides the ball (ShellGame tasks).
`RotateRenderAngleInfoWrapper`	Target and current rotation angle (Rotate tasks).
`RenderTraceShapeDebugWrapper`	Target path overlay (TraceShape tasks).
`RenderTimedTransferInfoWrapper`	Timer countdown (TimedTransfer tasks).
`DebugRewardWrapper`	Breakdown of reward sub-terms for reward engineering.

Task-Specific Info Wrappers #

These wrappers inject task-specific fields into the info dict returned by env.step(). They are used during collection and can be useful for evaluation scripts that need access to ground-truth labels.

Wrapper	Added `info` fields
`RememberColorInfoWrapper`	Ground-truth target colour index.
`RememberShapeInfoWrapper`	Ground-truth target shape index.
`RememberShapeAndColorInfoWrapper`	Ground-truth target shape and colour pair.
`MemoryCapacityInfoWrapper`	All memorised items for capacity-memory tasks.

Minimal Stacks for Common Scenarios #

The recommended stacks below all wrap through apply_mikasa_vla_wrappers(). For a manual chain see Wrappers API and the per-env “Recommended Wrappers” sections in Environments & Tasks.

Dataset collection / VLA training data (any task):

from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env = gym.make(env_id, num_envs=N, obs_mode="rgb",
               control_mode="pd_ee_delta_pose", render_mode="all")
env = apply_mikasa_vla_wrappers(env)

VLA evaluation (clean observations, no rendered overlays):

from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env = gym.make(env_id, num_envs=N, obs_mode="rgb",
               control_mode="pd_ee_delta_pose")
env = apply_mikasa_vla_wrappers(env, include_overlays=False)

Local debugging with video (overlays kept, RecordEpisode outermost):

from mani_skill.utils.wrappers import RecordEpisode
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers

env = gym.make(env_id, num_envs=N, obs_mode="rgb",
               control_mode="pd_ee_delta_pose", render_mode="all")
env = apply_mikasa_vla_wrappers(env)
env = RecordEpisode(env, f"./videos/{env_id}", max_steps_per_video=max_steps)