mjlab.envs#
RL environment classes.
Classes:
Manager-based RL environment. |
|
Configuration for a manager-based RL environment. |
- class mjlab.envs.ManagerBasedRlEnv[source]#
Bases:
objectManager-based RL environment.
Attributes:
Number of parallel environments.
Physics simulation step size.
Environment step size (physics_dt * decimation).
Device for computation.
Maximum episode length in seconds.
Maximum episode length in steps.
Get the unwrapped environment (base case for wrapper chains).
Methods:
__init__(cfg, device[, render_mode])Load and initialize all managers.
reset(*[, seed, env_ids, options])step(action)Run one environment step: apply actions, simulate, compute RL signals.
render()close()seed([seed])update_visualizers(visualizer)- is_vector_env = True#
- metadata = {'mujoco_version': '3.4.1', 'render_modes': [None, 'rgb_array'], 'warp_version': warp.config.version}#
- __init__(cfg: ManagerBasedRlEnvCfg, device: str, render_mode: str | None = None, **kwargs) None[source]#
- property unwrapped: ManagerBasedRlEnv#
Get the unwrapped environment (base case for wrapper chains).
- load_managers() None[source]#
Load and initialize all managers.
Order is important! Event and command managers must be loaded first, then action and observation managers, then other RL managers.
- reset(*, seed: int | None = None, env_ids: Tensor | None = None, options: dict[str, Any] | None = None) tuple[Dict[str, Tensor | Dict[str, Tensor]], dict][source]#
- step(action: Tensor) tuple[Dict[str, Tensor | Dict[str, Tensor]], Tensor, Tensor, Tensor, dict][source]#
Run one environment step: apply actions, simulate, compute RL signals.
Forward-call placement. MuJoCo’s
mj_stepruns forward kinematics before integration, so after stepping, derived quantities (xpos,xquat,site_xpos,cvel,sensordata) lagqpos/qvelby one physics substep. Rather than callingsim.forward()twice (once after the decimation loop and once after the reset block), this method calls it once, right before observation computation. This single call refreshes derived quantities for all envs: non-reset envs pick up post-decimation kinematics, reset envs pick up post-reset kinematics.The tradeoff is that termination and reward managers see derived quantities that are stale by one physics substep (the last
mj_stepranmj_forwardfrom pre-integrationqpos). In practice, the staleness is negligible for reward shaping and termination checks. Critically, the staleness is consistent: every env, every step, always sees the same lag, so the MDP is well-defined and the value function can learn the correct mapping.Note
Event and command authors do not need to call
sim.forward()themselves. This method handles it. The only constraint is: do not read derived quantities (root_link_pose_w,body_link_vel_w, etc.) in the same function that writes state (write_root_state_to_sim,write_joint_state_to_sim, etc.). See FAQ & Troubleshooting for details.
- class mjlab.envs.ManagerBasedRlEnvCfg[source]#
Bases:
objectConfiguration for a manager-based RL environment.
This config defines all aspects of an RL environment: the physical scene, observations, actions, rewards, terminations, and optional features like commands and curriculum learning.
The environment step size is
sim.mujoco.timestep * decimation. For example, with a 2ms physics timestep and decimation=10, the environment runs at 50Hz.Attributes:
Number of physics simulation steps per environment step.
Scene configuration defining terrain, entities, and sensors.
Observation groups configuration.
Action terms configuration.
Event terms for domain randomization and state resets.
Random seed for reproducibility.
Simulation configuration including physics timestep, solver iterations, contact parameters, and NaN guarding.
Viewer configuration for rendering (camera position, resolution, etc.).
Duration of an episode (in seconds).
Reward terms configuration.
Termination terms configuration.
Command generator terms (e.g., velocity targets).
Curriculum terms for adaptive difficulty.
Custom metric terms for logging per-step values as episode averages.
Whether the task has a finite or infinite horizon.
Whether to multiply rewards by the environment step duration (dt).
Methods:
__init__(*, decimation, scene[, ...])- decimation: int#
Number of physics simulation steps per environment step. Higher values mean coarser control frequency. Environment step duration = physics_dt * decimation.
- scene: SceneCfg#
Scene configuration defining terrain, entities, and sensors. The scene specifies
num_envs, the number of parallel environments.
- observations: dict[str, ObservationGroupCfg]#
Observation groups configuration. Each group (e.g., “actor”, “critic”) contains observation terms that are concatenated. Groups can have different settings for noise, history, and delay.
- actions: dict[str, ActionTermCfg]#
Action terms configuration. Each term controls a specific entity/aspect (e.g., joint positions). Action dimensions are concatenated across terms.
- events: dict[str, EventTermCfg]#
Event terms for domain randomization and state resets. Default includes
reset_scene_to_defaultwhich resets entities to their initial state. Can be set to empty to disable all events including default reset.
- seed: int | None = None#
Random seed for reproducibility. If None, a random seed is used. The actual seed used is stored back into this field after initialization.
- sim: SimulationCfg#
Simulation configuration including physics timestep, solver iterations, contact parameters, and NaN guarding.
- viewer: ViewerConfig#
Viewer configuration for rendering (camera position, resolution, etc.).
- episode_length_s: float = 0.0#
Duration of an episode (in seconds).
- Episode length in steps is computed as:
ceil(episode_length_s / (sim.mujoco.timestep * decimation))
- rewards: dict[str, RewardTermCfg]#
Reward terms configuration.
- terminations: dict[str, TerminationTermCfg]#
Termination terms configuration. If empty, episodes never reset. Use
mdp.time_outwithtime_out=Truefor episode time limits.
- commands: dict[str, CommandTermCfg]#
Command generator terms (e.g., velocity targets).
- curriculum: dict[str, CurriculumTermCfg]#
Curriculum terms for adaptive difficulty.
- metrics: dict[str, MetricsTermCfg]#
Custom metric terms for logging per-step values as episode averages.
- is_finite_horizon: bool = False#
Whether the task has a finite or infinite horizon. Defaults to False (infinite).
Finite horizon (True): The time limit defines the task boundary. When reached, no future value exists beyond it, so the agent receives a terminal done signal.
Infinite horizon (False): The time limit is an artificial cutoff. The agent receives a truncated done signal to bootstrap the value of continuing beyond the limit.
- scale_rewards_by_dt: bool = True#
Whether to multiply rewards by the environment step duration (dt).
When True (default), reward values are scaled by step_dt to normalize cumulative episodic rewards across different simulation frequencies. Set to False for algorithms that expect unscaled reward signals (e.g., HER, static reward scaling).
- __init__(*, decimation: int, scene: ~mjlab.scene.scene.SceneCfg, observations: dict[str, ~mjlab.managers.observation_manager.ObservationGroupCfg] = <factory>, actions: dict[str, ~mjlab.managers.action_manager.ActionTermCfg] = <factory>, events: dict[str, ~mjlab.managers.event_manager.EventTermCfg] = <factory>, seed: int | None = None, sim: ~mjlab.sim.sim.SimulationCfg = <factory>, viewer: ~mjlab.viewer.viewer_config.ViewerConfig = <factory>, episode_length_s: float = 0.0, rewards: dict[str, ~mjlab.managers.reward_manager.RewardTermCfg] = <factory>, terminations: dict[str, ~mjlab.managers.termination_manager.TerminationTermCfg] = <factory>, commands: dict[str, ~mjlab.managers.command_manager.CommandTermCfg] = <factory>, curriculum: dict[str, ~mjlab.managers.curriculum_manager.CurriculumTermCfg] = <factory>, metrics: dict[str, ~mjlab.managers.metrics_manager.MetricsTermCfg] = <factory>, is_finite_horizon: bool = False, scale_rewards_by_dt: bool = True) None#