Observation and Action Space#
MIKASA-Robo-VLA exposes two observation modes that can be selected when constructing any environment. The same action and reward interface applies regardless of which mode you choose.
Observation Modes#
obs_mode="state"Privileged simulator state — an efficient flat tensor that contains the full physical state of the scene. Use it for PPO oracle training, fast debugging, and reward sanity checks. In this mode the raw ManiSkill observation is a single
statetensor and there are no camera images.obs_mode="rgb"RGB image observations from two cameras (top-down and wrist-mounted), plus proprioception. This is the standard mode for VLA training and evaluation. In this mode the raw ManiSkill observation contains
sensor_data(per camera RGB) andagent/extrajoint state — no flatstatetensor.
The two modes are mutually exclusive: pick one when calling gym.make.
Raw vs Wrapped Observations#
The shapes you actually feed to a VLA model depend on which wrappers are
applied. The canonical chain (used by
apply_mikasa_vla_wrappers()
and by every published dataset collector) is:
gym.make(obs_mode="rgb", control_mode="pd_ee_delta_pose")
└─ StateOnlyTensorToDictWrapper # adds task_cue + oracle_info
└─ <task-specific info / overlay wrappers>
└─ FlattenRGBDObservationWrapper(rgb=True, joints=True)
# collapses sensor_data → obs["rgb"] (concat 2 cams)
# exposes obs["joints"] from agent state
└─ ConvertJointsToEEFXyzRpyGripperWrapper
# rewrites obs["joints"] → obs["proprio"] (7D EEF)
After the canonical chain the observation is a flat Python dict. These are
the keys VLA training and evaluation code should consume (B = num_envs):
Key |
Shape / dtype |
Description |
|---|---|---|
|
|
Top-down and wrist cameras concatenated along the channel axis.
|
|
|
Absolute end-effector pose + gripper opening — see the next section. |
|
|
RL-only. Numeric target-angle cue exposed exclusively by the
Rotate* family (PPO oracles need it because the angle is not
inferable from a single RGB frame). The canonical VLA helper
drops the key for all other tasks; for Rotate* tasks VLA
policies may ignore it because the same information is already in
|
Note
If you skip the canonical chain and use only
StateOnlyTensorToDictWrapper,
the raw ManiSkill keys sensor_data (per camera) and agent /
extra are preserved instead of the flattened obs["rgb"] /
obs["proprio"]. In that case the per-camera RGB is reachable at
obs["sensor_data"]["base_camera"]["rgb"] and
obs["sensor_data"]["hand_camera"]["rgb"], each (B, 128, 128, 3)
uint8. All published datasets and the benchmark runner expect the
canonical layout above — prefer it for any VLA pipeline.
The sentinel value 4242424242 is documented in its own section below.
Canonical VLA Proprioception and Action#
The online wrapped environment and the published VLA datasets use the same
7D proprioception and action format. After
apply_mikasa_vla_wrappers(),
use obs["proprio"] as the VLA proprioception input. It is not a vector
of arm joint angles and it is not an action delta. For one environment the
semantic layout is:
obs["proprio"] =
[
eef_x,
eef_y,
eef_z,
eef_roll,
eef_pitch,
eef_yaw,
gripper_opening,
]
The batched online tensor has shape (B, 7) and dtype float32; the
dataset signal has shape [T, 7] and dtype float32. The values are
constructed from the absolute Panda TCP pose and finger qpos values by
ConvertJointsToEEFXyzRpyGripperWrapper.
Fields |
Range / units |
Meaning |
|---|---|---|
|
Absolute xyz in metres; not normalized; no finite wrapper clamp. |
Absolute end-effector TCP position in the ManiSkill scene frame. |
|
Radians, |
Absolute TCP roll and yaw. |
|
Radians, |
Absolute TCP pitch. |
|
Metres, Panda physical range |
Current gripper opening. |
The action sent to env.step(action) is also 7D, but it has different
semantics. For one environment it is:
action =
[
delta_eef_x,
delta_eef_y,
delta_eef_z,
delta_eef_roll,
delta_eef_pitch,
delta_eef_yaw,
gripper_command,
]
Use dtype float32. A single action has shape (7,) and a batched action
has shape (B, 7). The action values stored in VLA datasets are in this
normalized pd_ee_delta_pose environment action space.
Fields |
User-facing range |
Meaning |
|---|---|---|
|
Each value in |
Relative end-effector translation command, not an absolute xyz pose. |
|
Each value is supplied in |
Relative end-effector orientation command, not an absolute rpy pose. |
|
|
Normalized gripper position command. It is not
|
The Panda gripper has two symmetric fingers. Each finger’s qpos range is
[0.0, 0.04] m, so obs["proprio"][..., 6] (the sum of both) is in
[0.0, 0.08] m. The gripper controller target is sent per-finger in
[-0.01, 0.04] m; the slightly negative lower bound helps apply closing
force against a grasped object. This internal controller detail does not
affect the normalised action interface — action[6] is always in [-1, 1].
Dataset Signals#
In published NPZ, RLDS, and LeRobot VLA datasets, each timestep stores the same signal semantics:
Signal |
Shape/dtype |
Description |
|---|---|---|
|
|
Top and wrist RGB images pre-concatenated on the channel axis
( |
|
|
Same absolute vector as online |
|
|
Same normalized |
|
string |
Natural-language task instruction. |
|
|
Per-step reward in the chosen reward mode. |
|
|
Whether the episode success condition was met at this step. |
Always pass control_mode="pd_ee_delta_pose" when constructing
environments for dataset collection or VLA evaluation.
Reward Modes#
Pass reward_mode to gym.make (or the PPO script) to select the
reward signal:
Mode |
Description |
|---|---|
|
Binary reward: 1.0 on task success, 0.0 otherwise. Standard for imitation learning; not used during PPO training. |
|
Shaped reward summing task-specific sub-goal bonuses. Values vary per environment. |
|
Dense reward normalised to |
Sentinel Values#
StateOnlyTensorToDictWrapper uses the sentinel value 4242424242 for
task_cue and oracle_info when the underlying environment does not
expose them. Later wrappers in the canonical chain then drop the keys
that are not meaningful for VLA policies:
oracle_infois removed for every task — it is privileged and never visible to a VLA policy in canonical evaluation.task_cueis kept only for the Rotate* tasks (the rotation angle in degrees), since PPO oracles trained from RGB cannot recover the target angle from images alone. For all other tasks the key is dropped after the canonical chain.
Check the final wrapped observation by key presence — VLA pipelines that want strictly images + proprioception can simply not read it:
task_cue = obs.get("task_cue") # None for non-Rotate tasks
# VLA policies may ignore task_cue: the same information is already
# encoded in info["language_instruction"].