Observation and Action Space#

MIKASA-Robo-VLA exposes two observation modes that can be selected when constructing any environment. The same action and reward interface applies regardless of which mode you choose.

Observation Modes#

obs_mode="state"

Privileged simulator state — an efficient flat tensor that contains the full physical state of the scene. Use it for PPO oracle training, fast debugging, and reward sanity checks. In this mode the raw ManiSkill observation is a single state tensor and there are no camera images.

obs_mode="rgb"

RGB image observations from two cameras (top-down and wrist-mounted), plus proprioception. This is the standard mode for VLA training and evaluation. In this mode the raw ManiSkill observation contains sensor_data (per camera RGB) and agent / extra joint state — no flat state tensor.

The two modes are mutually exclusive: pick one when calling gym.make.

Raw vs Wrapped Observations#

The shapes you actually feed to a VLA model depend on which wrappers are applied. The canonical chain (used by apply_mikasa_vla_wrappers() and by every published dataset collector) is:

gym.make(obs_mode="rgb", control_mode="pd_ee_delta_pose")
  └─ StateOnlyTensorToDictWrapper          # adds task_cue + oracle_info
      └─ <task-specific info / overlay wrappers>
          └─ FlattenRGBDObservationWrapper(rgb=True, joints=True)
               # collapses sensor_data → obs["rgb"] (concat 2 cams)
               # exposes obs["joints"] from agent state
              └─ ConvertJointsToEEFXyzRpyGripperWrapper
                   # rewrites obs["joints"] → obs["proprio"]  (7D EEF)

After the canonical chain the observation is a flat Python dict. These are the keys VLA training and evaluation code should consume (B = num_envs):

Canonical wrapped observation (apply_mikasa_vla_wrappers)#

Key

Shape / dtype

Description

obs["rgb"]

(B, 128, 128, 6) uint8

Top-down and wrist cameras concatenated along the channel axis. obs["rgb"][..., :3] is base_camera (top-down); obs["rgb"][..., 3:6] is hand_camera (wrist).

obs["proprio"]

(B, 7) float32

Absolute end-effector pose + gripper opening — see the next section.

obs["task_cue"]

(B, P) float32

RL-only. Numeric target-angle cue exposed exclusively by the Rotate* family (PPO oracles need it because the angle is not inferable from a single RGB frame). The canonical VLA helper drops the key for all other tasks; for Rotate* tasks VLA policies may ignore it because the same information is already in info["language_instruction"].

Note

If you skip the canonical chain and use only StateOnlyTensorToDictWrapper, the raw ManiSkill keys sensor_data (per camera) and agent / extra are preserved instead of the flattened obs["rgb"] / obs["proprio"]. In that case the per-camera RGB is reachable at obs["sensor_data"]["base_camera"]["rgb"] and obs["sensor_data"]["hand_camera"]["rgb"], each (B, 128, 128, 3) uint8. All published datasets and the benchmark runner expect the canonical layout above — prefer it for any VLA pipeline.

The sentinel value 4242424242 is documented in its own section below.

Canonical VLA Proprioception and Action#

The online wrapped environment and the published VLA datasets use the same 7D proprioception and action format. After apply_mikasa_vla_wrappers(), use obs["proprio"] as the VLA proprioception input. It is not a vector of arm joint angles and it is not an action delta. For one environment the semantic layout is:

obs["proprio"] =
[
    eef_x,
    eef_y,
    eef_z,
    eef_roll,
    eef_pitch,
    eef_yaw,
    gripper_opening,
]

The batched online tensor has shape (B, 7) and dtype float32; the dataset signal has shape [T, 7] and dtype float32. The values are constructed from the absolute Panda TCP pose and finger qpos values by ConvertJointsToEEFXyzRpyGripperWrapper.

obs["proprio"] fields#

Fields

Range / units

Meaning

[0:3]

Absolute xyz in metres; not normalized; no finite wrapper clamp.

Absolute end-effector TCP position in the ManiSkill scene frame.

[3] and [5]

Radians, [-pi, pi] from quaternion-to-Euler conversion.

Absolute TCP roll and yaw.

[4]

Radians, [-pi/2, pi/2] from quaternion-to-Euler conversion.

Absolute TCP pitch.

[6]

Metres, Panda physical range [0.0, 0.08] after summing the two finger qpos values; not normalized.

Current gripper opening. 0 is closed and about 0.08 is open.

The action sent to env.step(action) is also 7D, but it has different semantics. For one environment it is:

action =
[
    delta_eef_x,
    delta_eef_y,
    delta_eef_z,
    delta_eef_roll,
    delta_eef_pitch,
    delta_eef_yaw,
    gripper_command,
]

Use dtype float32. A single action has shape (7,) and a batched action has shape (B, 7). The action values stored in VLA datasets are in this normalized pd_ee_delta_pose environment action space.

action fields for the Panda pd_ee_delta_pose controller#

Fields

User-facing range

Meaning

[0:3]

Each value in [-1, 1]. Panda maps this normalized range to xyz deltas in [-0.1, 0.1] metres per control action.

Relative end-effector translation command, not an absolute xyz pose.

[3:6]

Each value is supplied in [-1, 1]. The Panda EEF-pose controller clips the rotation vector norm and applies its 0.1 radian rotation limit per control action.

Relative end-effector orientation command, not an absolute rpy pose.

[6]

[-1, 1]. Panda maps -1 toward the closed gripper target and +1 toward the open gripper target.

Normalized gripper position command. It is not delta_gripper_opening and is not the same quantity as obs["proprio"][..., 6].

The Panda gripper has two symmetric fingers. Each finger’s qpos range is [0.0, 0.04] m, so obs["proprio"][..., 6] (the sum of both) is in [0.0, 0.08] m. The gripper controller target is sent per-finger in [-0.01, 0.04] m; the slightly negative lower bound helps apply closing force against a grasped object. This internal controller detail does not affect the normalised action interface — action[6] is always in [-1, 1].

Dataset Signals#

In published NPZ, RLDS, and LeRobot VLA datasets, each timestep stores the same signal semantics:

Signal

Shape/dtype

Description

rgb

[T, 128, 128, 6] uint8

Top and wrist RGB images pre-concatenated on the channel axis ([..., :3] = base_camera, [..., 3:6] = hand_camera); this matches the online wrapped obs["rgb"] byte-for-byte, so a model trained on the dataset can be evaluated without any reshape.

proprio

[T, 7] float32

Same absolute vector as online obs["proprio"].

action

[T, 7] float32

Same normalized pd_ee_delta_pose action vector accepted by env.step(action).

language_instruction

string

Natural-language task instruction.

reward

[T] float32

Per-step reward in the chosen reward mode.

success

[T] bool

Whether the episode success condition was met at this step.

Always pass control_mode="pd_ee_delta_pose" when constructing environments for dataset collection or VLA evaluation.

Reward Modes#

Pass reward_mode to gym.make (or the PPO script) to select the reward signal:

Mode

Description

sparse

Binary reward: 1.0 on task success, 0.0 otherwise. Standard for imitation learning; not used during PPO training.

dense

Shaped reward summing task-specific sub-goal bonuses. Values vary per environment.

normalized_dense

Dense reward normalised to [0, 1] per environment. Recommended for PPO training as it keeps learning rates comparable across environments.

Sentinel Values#

StateOnlyTensorToDictWrapper uses the sentinel value 4242424242 for task_cue and oracle_info when the underlying environment does not expose them. Later wrappers in the canonical chain then drop the keys that are not meaningful for VLA policies:

  • oracle_info is removed for every task — it is privileged and never visible to a VLA policy in canonical evaluation.

  • task_cue is kept only for the Rotate* tasks (the rotation angle in degrees), since PPO oracles trained from RGB cannot recover the target angle from images alone. For all other tasks the key is dropped after the canonical chain.

Check the final wrapped observation by key presence — VLA pipelines that want strictly images + proprioception can simply not read it:

task_cue = obs.get("task_cue")   # None for non-Rotate tasks
# VLA policies may ignore task_cue: the same information is already
# encoded in info["language_instruction"].