mjlab.envs#

RL environment classes.

class mjlab.envs.ManagerBasedRlEnv[source]#

Bases: object

Manager-based RL environment.

__init__(cfg: ManagerBasedRlEnvCfg, device: str, render_mode: str | None = None, **kwargs) None[source]#
close() None[source]#
property device: str#

Device for computation.

is_vector_env = True#
load_managers() None[source]#

Load and initialize all managers.

Order is important! Event and command managers must be loaded first, then action and observation managers, then other RL managers.

property max_episode_length: int#

Maximum episode length in steps.

property max_episode_length_s: float#

Maximum episode length in seconds.

metadata = {'mujoco_version': '3.6.0', 'render_modes': [None, 'rgb_array'], 'warp_version': warp.config.version}#
property num_envs: int#

Number of parallel environments.

property physics_dt: float#

Physics simulation step size.

render() ndarray | None[source]#
reset(*, seed: int | None = None, env_ids: Tensor | None = None, options: dict[str, Any] | None = None) tuple[Dict[str, Tensor | Dict[str, Tensor]], dict][source]#
static seed(seed: int = -1) int[source]#
setup_manager_visualizers() None[source]#
step(action: Tensor) tuple[Dict[str, Tensor | Dict[str, Tensor]], Tensor, Tensor, Tensor, dict][source]#

Run one environment step: apply actions, simulate, compute RL signals.

Forward-call placement. MuJoCo’s mj_step runs forward kinematics before integration, so after stepping, derived quantities (xpos, xquat, site_xpos, cvel, sensordata) lag qpos/qvel by one physics substep. Rather than calling sim.forward() twice (once after the decimation loop and once after the reset block), this method calls it once, right before observation computation. This single call refreshes derived quantities for all envs: non-reset envs pick up post-decimation kinematics, reset envs pick up post-reset kinematics.

The tradeoff is that termination and reward managers see derived quantities that are stale by one physics substep (the last mj_step ran mj_forward from pre-integration qpos). In practice, the staleness is negligible for reward shaping and termination checks. Critically, the staleness is consistent: every env, every step, always sees the same lag, so the MDP is well-defined and the value function can learn the correct mapping.

Note

Event and command authors do not need to call sim.forward() themselves. This method handles it. The only constraint is: do not read derived quantities (root_link_pose_w, body_link_vel_w, etc.) in the same function that writes state (write_root_state_to_sim, write_joint_state_to_sim, etc.). See FAQ & Troubleshooting for details.

property step_dt: float#

Environment step size (physics_dt * decimation).

property unwrapped: ManagerBasedRlEnv#

Get the unwrapped environment (base case for wrapper chains).

update_visualizers(visualizer: DebugVisualizer) None[source]#
cfg: ManagerBasedRlEnvCfg#
class mjlab.envs.ManagerBasedRlEnvCfg[source]#

Bases: object

Configuration for a manager-based RL environment.

This config defines all aspects of an RL environment: the physical scene, observations, actions, rewards, terminations, and optional features like commands and curriculum learning.

The environment step size is sim.mujoco.timestep * decimation. For example, with a 2ms physics timestep and decimation=10, the environment runs at 50Hz.

__init__(*, decimation: int, scene: ~mjlab.scene.scene.SceneCfg, observations: dict[str, ~mjlab.managers.observation_manager.ObservationGroupCfg] = <factory>, actions: dict[str, ~mjlab.managers.action_manager.ActionTermCfg] = <factory>, events: dict[str, ~mjlab.managers.event_manager.EventTermCfg] = <factory>, seed: int | None = None, sim: ~mjlab.sim.sim.SimulationCfg = <factory>, viewer: ~mjlab.viewer.viewer_config.ViewerConfig = <factory>, episode_length_s: float = 0.0, rewards: dict[str, ~mjlab.managers.reward_manager.RewardTermCfg] = <factory>, terminations: dict[str, ~mjlab.managers.termination_manager.TerminationTermCfg] = <factory>, commands: dict[str, ~mjlab.managers.command_manager.CommandTermCfg] = <factory>, curriculum: dict[str, ~mjlab.managers.curriculum_manager.CurriculumTermCfg] = <factory>, metrics: dict[str, ~mjlab.managers.metrics_manager.MetricsTermCfg] = <factory>, is_finite_horizon: bool = False, scale_rewards_by_dt: bool = True) None#
episode_length_s: float = 0.0#

Duration of an episode (in seconds).

Episode length in steps is computed as:

ceil(episode_length_s / (sim.mujoco.timestep * decimation))

is_finite_horizon: bool = False#

Whether the task has a finite or infinite horizon. Defaults to False (infinite).

  • Finite horizon (True): The time limit defines the task boundary. When reached, no future value exists beyond it, so the agent receives a terminal done signal.

  • Infinite horizon (False): The time limit is an artificial cutoff. The agent receives a truncated done signal to bootstrap the value of continuing beyond the limit.

scale_rewards_by_dt: bool = True#

Whether to multiply rewards by the environment step duration (dt).

When True (default), reward values are scaled by step_dt to normalize cumulative episodic rewards across different simulation frequencies. Set to False for algorithms that expect unscaled reward signals (e.g., HER, static reward scaling).

seed: int | None = None#

Random seed for reproducibility. If None, a random seed is used. The actual seed used is stored back into this field after initialization.

decimation: int#

Number of physics simulation steps per environment step. Higher values mean coarser control frequency. Environment step duration = physics_dt * decimation.

scene: SceneCfg#

Scene configuration defining terrain, entities, and sensors. The scene specifies num_envs, the number of parallel environments.

observations: dict[str, ObservationGroupCfg]#

Observation groups configuration. Each group (e.g., “actor”, “critic”) contains observation terms that are concatenated. Groups can have different settings for noise, history, and delay.

actions: dict[str, ActionTermCfg]#

Action terms configuration. Each term controls a specific entity/aspect (e.g., joint positions). Action dimensions are concatenated across terms.

events: dict[str, EventTermCfg]#

Event terms for domain randomization and state resets. Default includes reset_scene_to_default which resets entities to their initial state. Can be set to empty to disable all events including default reset.

sim: SimulationCfg#

Simulation configuration including physics timestep, solver iterations, contact parameters, and NaN guarding.

viewer: ViewerConfig#

Viewer configuration for rendering (camera position, resolution, etc.).

rewards: dict[str, RewardTermCfg]#

Reward terms configuration.

terminations: dict[str, TerminationTermCfg]#

Termination terms configuration. If empty, episodes never reset. Use mdp.time_out with time_out=True for episode time limits.

commands: dict[str, CommandTermCfg]#

Command generator terms (e.g., velocity targets).

curriculum: dict[str, CurriculumTermCfg]#

Curriculum terms for adaptive difficulty.

metrics: dict[str, MetricsTermCfg]#

Custom metric terms for logging per-step values as episode averages.