Observation and Action Space
============================

MIKASA-Robo-VLA exposes two observation modes that can be selected when
constructing any environment.  The same action and reward interface applies
regardless of which mode you choose.

Observation Modes
-----------------

``obs_mode="state"``
   Privileged simulator state — an efficient flat tensor that contains the
   full physical state of the scene.  Use it for PPO oracle training, fast
   debugging, and reward sanity checks.  In this mode the raw ManiSkill
   observation is a single ``state`` tensor and there are **no camera images**.

``obs_mode="rgb"``
   RGB image observations from two cameras (top-down and wrist-mounted), plus
   proprioception.  This is the standard mode for VLA training and evaluation.
   In this mode the raw ManiSkill observation contains ``sensor_data`` (per
   camera RGB) and ``agent`` / ``extra`` joint state — no flat ``state``
   tensor.

The two modes are mutually exclusive: pick one when calling ``gym.make``.

Raw vs Wrapped Observations
---------------------------

The shapes you actually feed to a VLA model depend on which wrappers are
applied.  The canonical chain (used by
:func:`~mikasa_robo_suite.vla.utils.apply_wrappers.apply_mikasa_vla_wrappers`
and by every published dataset collector) is:

.. code-block:: text

   gym.make(obs_mode="rgb", control_mode="pd_ee_delta_pose")
     └─ StateOnlyTensorToDictWrapper          # adds task_cue + oracle_info
         └─ <task-specific info / overlay wrappers>
             └─ FlattenRGBDObservationWrapper(rgb=True, joints=True)
                  # collapses sensor_data → obs["rgb"] (concat 2 cams)
                  # exposes obs["joints"] from agent state
                 └─ ConvertJointsToEEFXyzRpyGripperWrapper
                      # rewrites obs["joints"] → obs["proprio"]  (7D EEF)

After the canonical chain the observation is a flat Python dict.  These are
the keys VLA training and evaluation code should consume (``B = num_envs``):

.. list-table:: Canonical wrapped observation (``apply_mikasa_vla_wrappers``)
   :header-rows: 1
   :widths: 22 22 56

   * - Key
     - Shape / dtype
     - Description
   * - ``obs["rgb"]``
     - ``(B, 128, 128, 6)``  uint8
     - Top-down and wrist cameras concatenated along the channel axis.
       ``obs["rgb"][..., :3]`` is ``base_camera`` (top-down);
       ``obs["rgb"][..., 3:6]`` is ``hand_camera`` (wrist).
   * - ``obs["proprio"]``
     - ``(B, 7)``  float32
     - Absolute end-effector pose + gripper opening — see the next section.
   * - ``obs["task_cue"]``
     - ``(B, P)``  float32
     - **RL-only.**  Numeric target-angle cue exposed exclusively by the
       *Rotate\** family (PPO oracles need it because the angle is not
       inferable from a single RGB frame).  The canonical VLA helper
       *drops* the key for all other tasks; for *Rotate\** tasks VLA
       policies may ignore it because the same information is already in
       ``info["language_instruction"]``.

.. note::

   If you skip the canonical chain and use only
   :class:`~mikasa_robo_suite.vla.utils.wrappers.StateOnlyTensorToDictWrapper`,
   the raw ManiSkill keys ``sensor_data`` (per camera) and ``agent`` /
   ``extra`` are preserved instead of the flattened ``obs["rgb"]`` /
   ``obs["proprio"]``.  In that case the per-camera RGB is reachable at
   ``obs["sensor_data"]["base_camera"]["rgb"]`` and
   ``obs["sensor_data"]["hand_camera"]["rgb"]``, each ``(B, 128, 128, 3)``
   uint8.  All published datasets and the benchmark runner expect the
   *canonical* layout above — prefer it for any VLA pipeline.

The sentinel value ``4242424242`` is documented in its own section below.

Canonical VLA Proprioception and Action
----------------------------------------

The online wrapped environment and the published VLA datasets use the same
7D proprioception and action format. After
:func:`~mikasa_robo_suite.vla.utils.apply_wrappers.apply_mikasa_vla_wrappers`,
use ``obs["proprio"]`` as the VLA proprioception input. It is **not** a vector
of arm joint angles and it is **not** an action delta. For one environment the
semantic layout is::

   obs["proprio"] =
   [
       eef_x,
       eef_y,
       eef_z,
       eef_roll,
       eef_pitch,
       eef_yaw,
       gripper_opening,
   ]

The batched online tensor has shape ``(B, 7)`` and dtype ``float32``; the
dataset signal has shape ``[T, 7]`` and dtype ``float32``. The values are
constructed from the absolute Panda TCP pose and finger qpos values by
:class:`~mikasa_robo_suite.vla.utils.wrappers.ConvertJointsToEEFXyzRpyGripperWrapper`.

.. list-table:: ``obs["proprio"]`` fields
   :header-rows: 1
   :widths: 18 30 52

   * - Fields
     - Range / units
     - Meaning
   * - ``[0:3]``
     - Absolute xyz in metres; not normalized; no finite wrapper clamp.
     - Absolute end-effector TCP position in the ManiSkill scene frame.
   * - ``[3]`` and ``[5]``
     - Radians, ``[-pi, pi]`` from quaternion-to-Euler conversion.
     - Absolute TCP roll and yaw.
   * - ``[4]``
     - Radians, ``[-pi/2, pi/2]`` from quaternion-to-Euler conversion.
     - Absolute TCP pitch.
   * - ``[6]``
     - Metres, Panda physical range ``[0.0, 0.08]`` after summing the two
       finger qpos values; not normalized.
     - Current gripper opening. ``0`` is closed and about ``0.08`` is open.

The action sent to ``env.step(action)`` is also 7D, but it has different
semantics. For one environment it is::

   action =
   [
       delta_eef_x,
       delta_eef_y,
       delta_eef_z,
       delta_eef_roll,
       delta_eef_pitch,
       delta_eef_yaw,
       gripper_command,
   ]

Use dtype ``float32``. A single action has shape ``(7,)`` and a batched action
has shape ``(B, 7)``. The action values stored in VLA datasets are in this
normalized ``pd_ee_delta_pose`` environment action space.

.. list-table:: ``action`` fields for the Panda ``pd_ee_delta_pose`` controller
   :header-rows: 1
   :widths: 18 30 52

   * - Fields
     - User-facing range
     - Meaning
   * - ``[0:3]``
     - Each value in ``[-1, 1]``. Panda maps this normalized range to
       xyz deltas in ``[-0.1, 0.1]`` metres per control action.
     - Relative end-effector translation command, not an absolute xyz pose.
   * - ``[3:6]``
     - Each value is supplied in ``[-1, 1]``. The Panda EEF-pose controller
       clips the rotation vector norm and applies its ``0.1`` radian rotation
       limit per control action.
     - Relative end-effector orientation command, not an absolute rpy pose.
   * - ``[6]``
     - ``[-1, 1]``. Panda maps ``-1`` toward the closed gripper target and
       ``+1`` toward the open gripper target.
     - Normalized gripper **position command**. It is not
       ``delta_gripper_opening`` and is not the same quantity as
       ``obs["proprio"][..., 6]``.

The Panda gripper has two symmetric fingers.  Each finger's qpos range is
``[0.0, 0.04]`` m, so ``obs["proprio"][..., 6]`` (the sum of both) is in
``[0.0, 0.08]`` m.  The gripper controller target is sent per-finger in
``[-0.01, 0.04]`` m; the slightly negative lower bound helps apply closing
force against a grasped object.  This internal controller detail does not
affect the normalised action interface — ``action[6]`` is always in ``[-1, 1]``.

Dataset Signals
---------------

In published NPZ, RLDS, and LeRobot VLA datasets, each timestep stores the
same signal semantics:

.. list-table::
   :header-rows: 1
   :widths: 24 20 56

   * - Signal
     - Shape/dtype
     - Description
   * - ``rgb``
     - ``[T, 128, 128, 6]`` uint8
     - Top and wrist RGB images **pre-concatenated** on the channel axis
       (``[..., :3]`` = ``base_camera``, ``[..., 3:6]`` = ``hand_camera``);
       this matches the online wrapped ``obs["rgb"]`` byte-for-byte, so a
       model trained on the dataset can be evaluated without any reshape.
   * - ``proprio``
     - ``[T, 7]`` float32
     - Same absolute vector as online ``obs["proprio"]``.
   * - ``action``
     - ``[T, 7]`` float32
     - Same normalized ``pd_ee_delta_pose`` action vector accepted by
       ``env.step(action)``.
   * - ``language_instruction``
     - string
     - Natural-language task instruction.
   * - ``reward``
     - ``[T]`` float32
     - Per-step reward in the chosen reward mode.
   * - ``success``
     - ``[T]`` bool
     - Whether the episode success condition was met at this step.

Always pass ``control_mode="pd_ee_delta_pose"`` when constructing
environments for dataset collection or VLA evaluation.

Reward Modes
------------

Pass ``reward_mode`` to ``gym.make`` (or the PPO script) to select the
reward signal:

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - Mode
     - Description
   * - ``sparse``
     - Binary reward: 1.0 on task success, 0.0 otherwise.
       Standard for imitation learning; not used during PPO training.
   * - ``dense``
     - Shaped reward summing task-specific sub-goal bonuses.
       Values vary per environment.
   * - ``normalized_dense``
     - Dense reward normalised to ``[0, 1]`` per environment.
       Recommended for PPO training as it keeps learning rates
       comparable across environments.

Sentinel Values
---------------

``StateOnlyTensorToDictWrapper`` uses the sentinel value **4242424242** for
``task_cue`` and ``oracle_info`` when the underlying environment does not
expose them.  Later wrappers in the canonical chain then drop the keys
that are not meaningful for VLA policies:

- ``oracle_info`` is removed for every task — it is privileged and never
  visible to a VLA policy in canonical evaluation.
- ``task_cue`` is kept **only for the Rotate\* tasks** (the rotation
  angle in degrees), since PPO oracles trained from RGB cannot recover
  the target angle from images alone.  For all other tasks the key is
  dropped after the canonical chain.

Check the final wrapped observation by key presence — VLA pipelines that
want strictly *images + proprioception* can simply not read it::

   task_cue = obs.get("task_cue")   # None for non-Rotate tasks
   # VLA policies may ignore task_cue: the same information is already
   # encoded in info["language_instruction"].