Wrappers Cookbook#
MIKASA-Robo-VLA ships 19 Gymnasium wrappers in
mikasa_robo_suite/vla/utils/wrappers.py plus a one-call helper
apply_mikasa_vla_wrappers()
that picks the correct per-task chain automatically. This page groups the
individual wrappers by purpose and shows when and how to compose them
manually if you ever need to.
For the full API reference, see Wrappers API.
The Default: apply_mikasa_vla_wrappers#
For any of the 90 MIKASA-Robo-VLA tasks, the canonical wrapper stack — state→dict, curriculum noop (where needed), task-specific overlays, RGB-flatten, and EEF proprioception — is applied by a single function:
import gymnasium as gym
import mikasa_robo_suite.vla.memory_envs # registers VLA env IDs
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env = gym.make(
"RememberColor3-VLA-v0",
num_envs=1,
obs_mode="rgb",
control_mode="pd_ee_delta_pose",
render_mode="all",
)
env = apply_mikasa_vla_wrappers(env) # default for every task
The helper is the recommended entry point. It guarantees:
the same core obs format (
obs["rgb"]andobs["proprio"]) that the published datasets use, plusobs["task_cue"]only for the Rotate* family (the angle cue used by RL oracles; VLA policies should ignore it because the same value is already ininfo["language_instruction"]);the correct cue-phase noop wrapper for VLA tasks that need one (
CurriculumPhaseNoopActionWrapperin the canonicalpd_ee_delta_poseaction space);task-specific render overlays during
env.render()for human-watchable videos.
For headless metric-only evaluations pass include_overlays=False —
only the four functional wrappers are kept, no text on rendered frames,
but observations and reward are byte-identical to the default:
env = apply_mikasa_vla_wrappers(env, include_overlays=False)
See mikasa_robo_suite.vla.utils.apply_wrappers.apply_mikasa_vla_wrappers()
for the full per-env mapping.
Composition Order (manual)#
You only need this section if you are intentionally composing wrappers by
hand — for example, to reproduce the dataset collector or to experiment
with a non-standard chain. Otherwise use apply_mikasa_vla_wrappers.
Always apply wrappers in this order:
gym.make(...)
└─ StateOnlyTensorToDictWrapper ← must be first
└─ CurriculumPhaseNoopActionWrapper (cue-phase VLA tasks)
└─ <task-specific info wrappers> (overlays)
└─ <render / debug wrappers> (overlays, dev only)
└─ FlattenRGBDObservationWrapper(rgb=True, joints=True)
└─ ConvertJointsToEEFXyzRpyGripperWrapper
└─ RecordEpisode (outermost)
The env_info helper inside the PPO collector returns the same
task-specific overlay chain that apply_mikasa_vla_wrappers builds
internally, so you can inspect or replicate it manually:
# env_info lives in the PPO collector module; the dataset_collectors
# package is included with mikasa-robo-suite and is importable at runtime.
from mikasa_robo_suite.vla.dataset_collectors.get_mikasa_robo_datasets import env_info
wrappers_list, episode_timeout = env_info("RememberColor9-VLA-v0")
for wrapper_class, wrapper_kwargs in wrappers_list:
env = wrapper_class(env, **wrapper_kwargs)
Core Wrappers#
These two wrappers are part of the standard pipeline and are almost always required.
StateOnlyTensorToDictWrapper#
Converts the raw tensor observation into a dict and injects the
task_cue (filled with sentinel 4242424242 for tasks that do not
expose a numeric cue) and oracle_info fields:
from mikasa_robo_suite.vla.utils.wrappers import StateOnlyTensorToDictWrapper
env = gym.make("RememberColor3-VLA-v0", num_envs=1, obs_mode="rgb",
control_mode="pd_ee_delta_pose", render_mode="all")
env = StateOnlyTensorToDictWrapper(env)
obs, info = env.reset(seed=0)
print(obs.keys()) # dict_keys(['state'/'sensor_data', 'task_cue', 'oracle_info'])
When to use: always — it is the first wrapper in every recommended
stack. Downstream wrappers in apply_mikasa_vla_wrappers then drop
oracle_info and drop task_cue for every task except the
Rotate* family (where PPO oracles need the target angle, and where VLA
policies should still ignore it because the angle is already in
info["language_instruction"]).
ConvertJointsToEEFXyzRpyGripperWrapper#
Converts the flattened raw joint-state input obs["joints"] into the
public VLA proprioception key obs["proprio"] with the 7D
end-effector representation xyz(3) + rpy(3) + gripper(1).
The raw obs["joints"] key is removed from the output dict.
from mikasa_robo_suite.vla.utils.wrappers import (
StateOnlyTensorToDictWrapper,
ConvertJointsToEEFXyzRpyGripperWrapper,
)
env = StateOnlyTensorToDictWrapper(env)
env = ConvertJointsToEEFXyzRpyGripperWrapper(env)
After this wrapper, obs["proprio"] is the canonical 7D proprioception
vector used by all VLA datasets and evaluation scripts. See
Observation and Action Space for the field-by-field reference (units, ranges,
how gripper_opening differs from the gripper_command action).
When to use: for dataset collection and any downstream pipeline that
expects the 7D proprio vector (the format used in all published
MIKASA-Robo-VLA datasets).
Action-Shaping Wrappers#
InitialZeroActionWrapper#
Executes a fixed number of zero-action steps at the start of each episode. Useful for tasks that require the robot to settle before the cue phase begins.
CurriculumPhaseNoopActionWrapper#
Replaces the agent’s action with a no-op during the cue (and optional empty) phase, while the cue is being shown. This keeps the robot still so that the cue is fully visible, which mirrors the behaviour of the PPO oracle.
When to use: include it when reproducing the train-data rollout setup for
any task whose PPO env_info stack contains it. The
apply_mikasa_vla_wrappers()
helper adds it for those PPO-collected cue-phase tasks. Motion-planning data is
exported after replay through a plain pd_ee_delta_pose rollout, so the VLA
helper does not add a curriculum action filter for MP-only tasks.
CurriculumPhaseNoopActionWrapperPdJointPos#
pd_joint_pos-aware subclass of CurriculumPhaseNoopActionWrapper.
Plain zeros aren’t a “stand still” command in pd_joint_pos (they
would drive the robot toward qpos = [0, …, 0]), so during the cue
phase this wrapper substitutes the robot’s current qpos plus a
normalized gripper command — i.e. hold the current pose.
When to use: in the pd_joint_pos motion-planning oracle scripts for
BlinkCountButtonPress*. That hold wrapper is upstream of replay; the
published VLA train-data rollout and the VLA helper both expose the canonical
unfiltered pd_ee_delta_pose validation path.
CameraShutdownWrapper#
Disables the cameras during the memory phase of tasks where the cameras are explicitly turned off as part of the task design (e.g. BatteriesChecker). This ensures that the agent cannot cheat by looking at occluded objects.
Render / Debug Wrappers#
These wrappers overlay task-specific information on top of the rendered video frame. They are intended for local debugging and video generation; do not use them during benchmark evaluation.
Wrapper |
What it overlays |
|---|---|
|
Current step count. |
|
Per-step reward value. |
|
Button press progress bar (BlinkCountButtonPress tasks). |
|
Ground-truth working battery positions (BatteriesChecker tasks). |
|
Which cup hides the ball (ShellGame tasks). |
|
Target and current rotation angle (Rotate tasks). |
|
Target path overlay (TraceShape tasks). |
|
Timer countdown (TimedTransfer tasks). |
|
Breakdown of reward sub-terms for reward engineering. |
Task-Specific Info Wrappers#
These wrappers inject task-specific fields into the info dict returned
by env.step(). They are used during collection and can be useful for
evaluation scripts that need access to ground-truth labels.
Wrapper |
Added |
|---|---|
|
Ground-truth target colour index. |
|
Ground-truth target shape index. |
|
Ground-truth target shape and colour pair. |
|
All memorised items for capacity-memory tasks. |
Minimal Stacks for Common Scenarios#
The recommended stacks below all wrap through
apply_mikasa_vla_wrappers().
For a manual chain see Wrappers API and the per-env “Recommended
Wrappers” sections in Environments & Tasks.
Dataset collection / VLA training data (any task):
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env = gym.make(env_id, num_envs=N, obs_mode="rgb",
control_mode="pd_ee_delta_pose", render_mode="all")
env = apply_mikasa_vla_wrappers(env)
VLA evaluation (clean observations, no rendered overlays):
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env = gym.make(env_id, num_envs=N, obs_mode="rgb",
control_mode="pd_ee_delta_pose")
env = apply_mikasa_vla_wrappers(env, include_overlays=False)
Local debugging with video (overlays kept, RecordEpisode outermost):
from mani_skill.utils.wrappers import RecordEpisode
from mikasa_robo_suite.vla.utils.apply_wrappers import apply_mikasa_vla_wrappers
env = gym.make(env_id, num_envs=N, obs_mode="rgb",
control_mode="pd_ee_delta_pose", render_mode="all")
env = apply_mikasa_vla_wrappers(env)
env = RecordEpisode(env, f"./videos/{env_id}", max_steps_per_video=max_steps)