Environment Configuration#
A single ManagerBasedRlEnvCfg dataclass fully specifies an mjlab
environment: the physical world, the agent’s interface to it, and the
MDP defined on top. Because everything lives in one flat
object, an environment can be inspected, copied, and modified without
navigating a class hierarchy.
For a broad orientation to mjlab before reading this page, start with Architecture Overview.
Annotated skeleton#
The complete set of fields on ManagerBasedRlEnvCfg is shown below with
inline comments. The fields marked with ... must be provided; all others
have defaults.
from dataclasses import dataclass, field
from mjlab.envs import ManagerBasedRlEnvCfg
from mjlab.managers.action_manager import ActionTermCfg
from mjlab.managers.command_manager import CommandTermCfg
from mjlab.managers.curriculum_manager import CurriculumTermCfg
from mjlab.managers.event_manager import EventTermCfg
from mjlab.managers.metrics_manager import MetricsTermCfg
from mjlab.managers.observation_manager import ObservationGroupCfg
from mjlab.managers.reward_manager import RewardTermCfg
from mjlab.managers.termination_manager import TerminationTermCfg
from mjlab.scene.scene import SceneCfg
from mjlab.sim.sim import SimulationCfg
from mjlab.viewer.viewer_config import ViewerConfig
@dataclass
class MyEnvCfg(ManagerBasedRlEnvCfg):
# --- Physics ---
decimation: int = 4
# Number of physics steps per policy step.
# Environment step duration = sim.mujoco.timestep * decimation.
sim: SimulationCfg = field(default_factory=SimulationCfg)
# Physics parameters: timestep, integrator, solver, contact settings.
# Default timestep is 0.002 s (500 Hz). Override with MujocoCfg.
scene: SceneCfg = ...
# Terrain, entities, and sensors. Also sets num_envs.
# Required; there is no default.
# --- Episode ---
episode_length_s: float = 20.0
# Episode duration in seconds.
# Steps = ceil(episode_length_s / (sim.mujoco.timestep * decimation)).
is_finite_horizon: bool = False
# False (default): time limit is an artificial cutoff. The agent
# receives a truncated signal and bootstraps value beyond the limit.
# True: time limit defines the task boundary. The agent receives a
# terminal done signal with no future value beyond it.
scale_rewards_by_dt: bool = True
# When True (default), each reward term is multiplied by step_dt so
# that cumulative episodic sums are invariant to simulation frequency.
# Set to False for algorithms that expect unscaled reward signals.
# --- Managers ---
observations: dict[str, ObservationGroupCfg] = field(default_factory=dict)
# Observation groups. Each key is a group name (e.g. "actor", "critic").
# Groups can differ in noise, history, delay, and concatenation.
actions: dict[str, ActionTermCfg] = field(default_factory=dict)
# Action terms. Each term controls one slice of the policy output
# and routes it to a specific entity's actuators.
rewards: dict[str, RewardTermCfg] = field(default_factory=dict)
# Reward terms. The manager computes a weighted sum each step.
terminations: dict[str, TerminationTermCfg] = field(default_factory=dict)
# Termination conditions. If empty, episodes never terminate early.
# Add a time_out term to enforce the episode length limit.
events: dict[str, EventTermCfg] = field(
default_factory=lambda: {
"reset_scene_to_default": EventTermCfg(
func=reset_scene_to_default,
mode="reset",
)
}
)
# Event terms for domain randomization and state resets.
# The default includes reset_scene_to_default, which resets all
# entities to their initial pose each episode. Override this dict
# to replace or extend the default reset behavior.
commands: dict[str, CommandTermCfg] = field(default_factory=dict)
# Command generators (e.g. velocity targets for locomotion).
# Commands are resampled at configurable intervals and on reset.
curriculum: dict[str, CurriculumTermCfg] = field(default_factory=dict)
# Curriculum terms that adjust training conditions based on performance.
metrics: dict[str, MetricsTermCfg] = field(default_factory=dict)
# Custom metrics logged as episode averages alongside reward terms.
# --- Misc ---
seed: int | None = None
# Random seed for reproducibility. If None, a random seed is chosen
# and stored back into this field after initialization.
viewer: ViewerConfig = field(default_factory=ViewerConfig)
# Camera position, resolution, and tracking target for rendering.
Term configuration pattern#
All manager dictionaries follow the same pattern. Each entry maps a string
name to a term configuration object. The configuration always carries at
minimum a func field pointing to the callable that implements the term,
and a params dict of extra keyword arguments forwarded to that callable.
The manager calls func(env, **params) each step (or term(env, **params)
when func is a class that has been instantiated). Term names are arbitrary;
they appear in training logs and are used only for identification.
Reward terms
from mjlab.envs import mdp
from mjlab.managers.reward_manager import RewardTermCfg
from mjlab.managers.scene_entity_config import SceneEntityCfg
rewards = {
"alive": RewardTermCfg(
func=mdp.is_alive,
weight=1.0,
),
"joint_torques": RewardTermCfg(
func=mdp.joint_torques_l2,
weight=-1e-4,
params={"asset_cfg": SceneEntityCfg("robot")},
),
"action_rate": RewardTermCfg(
func=mdp.action_rate_l2,
weight=-0.1,
),
}
weight scales the function’s output before it is summed into the total
reward. Negative weights produce penalties.
params maps to keyword arguments of the function. For example,
mdp.joint_torques_l2(env, asset_cfg=...) receives asset_cfg from the
params dict. Any argument not listed in params must have a default
value in the function signature.
Termination terms
from mjlab.envs import mdp
from mjlab.managers.termination_manager import TerminationTermCfg
terminations = {
"time_out": TerminationTermCfg(
func=mdp.time_out,
time_out=True, # marks this as a truncation, not a failure
),
"fell_over": TerminationTermCfg(
func=mdp.bad_orientation,
params={"limit_angle": 1.22}, # ~70 degrees in radians
),
}
The time_out flag on TerminationTermCfg tells the manager to treat
this condition as a truncation rather than a terminal failure. Truncations
map to the truncated signal in the Gym interface; failures map to
terminated. This distinction matters for value bootstrapping in RL
algorithms.
Event terms
from mjlab.managers.event_manager import EventTermCfg
events = {
"reset_base": EventTermCfg(
func=mdp.reset_root_state_uniform,
mode="reset",
params={
"pose_range": {"yaw": (-3.14, 3.14)},
"velocity_range": {},
},
),
}
The mode field on EventTermCfg controls when the term fires:
at startup, on episode reset, or at regular intervals. See Events
for the full treatment of lifecycle modes, built-in event functions, and
the relationship between events and domain randomization.
Function-based vs. class-based terms
Terms can be plain functions or classes. Functions are suitable for stateless computations; classes are useful when a term needs to cache expensive setup or maintain state across steps.
A function-based term has the signature func(env, **params) -> Tensor. A
class-based term is instantiated once with (cfg, env) and then called with
the same signature. Classes can optionally implement a reset(env_ids) hook
for per-episode state clearing.
# Function-based (stateless)
RewardTermCfg(func=mdp.joint_torques_l2, weight=-0.01)
# Class-based (caches joint indices at init)
class MyReward:
def __init__(self, cfg, env):
self.joint_ids = resolve_joint_ids(cfg.params, env)
def __call__(self, env) -> torch.Tensor:
return compute_reward(env, self.joint_ids)
RewardTermCfg(func=MyReward, weight=1.0)
Timing: decimation, timestep, and episode length#
Three parameters jointly determine the temporal structure of the environment.
sim.mujoco.timestepThe physics integration step in seconds. The default is 0.002 s (500 Hz). This is one of the most important parameters in any environment: smaller values produce more stable physics but slow down simulation. See the MuJoCo performance tuning guide for practical advice on choosing timesteps and solver settings.
decimationThe number of physics steps executed per policy step. The policy runs at
1 / (timestep * decimation)Hz.episode_length_sThe episode duration in seconds. The maximum number of policy steps per episode is
ceil(episode_length_s / (timestep * decimation)).
Concrete example. The velocity task uses timestep=0.005 (200 Hz
physics) and decimation=4, giving a policy frequency of 50 Hz. With
episode_length_s=20.0, each episode runs for exactly 1000 policy steps.
physics_dt = 0.005 # seconds per physics step (200 Hz)
decimation = 4 # physics steps per policy step
step_dt = 0.005 * 4 # = 0.02 s per policy step (50 Hz)
episode_len = 20.0 / 0.02 # = 1000 policy steps per episode
To read these values at runtime, use the environment properties:
env.physics_dt # = cfg.sim.mujoco.timestep
env.step_dt # = cfg.sim.mujoco.timestep * cfg.decimation
env.max_episode_length # steps (int)
env.max_episode_length_s # seconds (float)
When scale_rewards_by_dt=True (the default), each reward term is
multiplied by step_dt before being returned. A reward function that
returns a constant value of 1.0 contributes step_dt per step and
approximately episode_length_s over a full episode, regardless of how
decimation and timestep are set. Changing the simulation frequency
without disabling this scaling leaves reward magnitudes unchanged.
Subclassing pattern#
mjlab uses plain dataclass inheritance rather than deeply nested class
hierarchies. To build a task-specific configuration, subclass
ManagerBasedRlEnvCfg and override fields.
The recommended approach is to define the full configuration in a factory
function, then call it from robot-specific configs that override only the
fields that differ. The velocity task uses this pattern: make_velocity_env_cfg
returns a fully assembled ManagerBasedRlEnvCfg, and each robot
configuration calls the factory and patches in robot-specific values such
as the scene, joint name patterns, and action scale.
A condensed version of the factory illustrates the full assembly pattern:
import math
from dataclasses import replace
from mjlab.envs import ManagerBasedRlEnvCfg
from mjlab.envs.mdp import dr
from mjlab.envs.mdp.actions import JointPositionActionCfg
from mjlab.managers.event_manager import EventTermCfg
from mjlab.managers.observation_manager import ObservationGroupCfg, ObservationTermCfg
from mjlab.managers.reward_manager import RewardTermCfg
from mjlab.managers.scene_entity_config import SceneEntityCfg
from mjlab.managers.termination_manager import TerminationTermCfg
from mjlab.scene import SceneCfg
from mjlab.sim import MujocoCfg, SimulationCfg
from mjlab.tasks.velocity import mdp
from mjlab.tasks.velocity.mdp import UniformVelocityCommandCfg
from mjlab.terrains import TerrainEntityCfg
from mjlab.terrains.config import ROUGH_TERRAINS_CFG
from mjlab.viewer import ViewerConfig
def make_velocity_env_cfg() -> ManagerBasedRlEnvCfg:
observations = {
"actor": ObservationGroupCfg(
terms={
"base_lin_vel": ObservationTermCfg(
func=mdp.builtin_sensor,
params={"sensor_name": "robot/imu_lin_vel"},
),
"joint_pos": ObservationTermCfg(func=mdp.joint_pos_rel),
"command": ObservationTermCfg(
func=mdp.generated_commands,
params={"command_name": "twist"},
),
# additional terms omitted for brevity
},
concatenate_terms=True,
enable_corruption=True,
),
"critic": ObservationGroupCfg(
terms={...},
concatenate_terms=True,
enable_corruption=False,
),
}
actions = {
"joint_pos": JointPositionActionCfg(
entity_name="robot",
actuator_names=(".*",),
scale=0.5,
use_default_offset=True,
)
}
commands = {
"twist": UniformVelocityCommandCfg(
entity_name="robot",
resampling_time_range=(3.0, 8.0),
ranges=UniformVelocityCommandCfg.Ranges(
lin_vel_x=(-1.0, 1.0),
lin_vel_y=(-1.0, 1.0),
ang_vel_z=(-0.5, 0.5),
heading=(-math.pi, math.pi),
),
)
}
events = {
"reset_base": EventTermCfg(
func=mdp.reset_root_state_uniform,
mode="reset",
params={
"pose_range": {"x": (-0.5, 0.5), "y": (-0.5, 0.5), "yaw": (-3.14, 3.14)},
"velocity_range": {},
},
),
"foot_friction": EventTermCfg(
mode="startup",
func=dr.geom_friction,
params={
"asset_cfg": SceneEntityCfg("robot", geom_names=[]),
"operation": "abs",
"ranges": (0.3, 1.2),
},
),
"push_robot": EventTermCfg(
func=mdp.push_by_setting_velocity,
mode="interval",
interval_range_s=(1.0, 3.0),
params={"velocity_range": {"x": (-0.5, 0.5), "y": (-0.5, 0.5)}},
),
}
rewards = {
"track_linear_velocity": RewardTermCfg(
func=mdp.track_linear_velocity,
weight=2.0,
params={"command_name": "twist", "std": math.sqrt(0.25)},
),
"dof_pos_limits": RewardTermCfg(func=mdp.joint_pos_limits, weight=-1.0),
"action_rate_l2": RewardTermCfg(func=mdp.action_rate_l2, weight=-0.1),
}
terminations = {
"time_out": TerminationTermCfg(func=mdp.time_out, time_out=True),
"fell_over": TerminationTermCfg(
func=mdp.bad_orientation,
params={"limit_angle": math.radians(70.0)},
),
}
return ManagerBasedRlEnvCfg(
decimation=4,
episode_length_s=20.0,
sim=SimulationCfg(
nconmax=35,
njmax=1500,
mujoco=MujocoCfg(timestep=0.005, iterations=10, ls_iterations=20),
),
scene=SceneCfg(
terrain=TerrainEntityCfg(
terrain_type="generator",
terrain_generator=replace(ROUGH_TERRAINS_CFG),
max_init_terrain_level=5,
),
num_envs=1,
),
observations=observations,
actions=actions,
commands=commands,
events=events,
rewards=rewards,
terminations=terminations,
)
Robot-specific configs call this factory and patch fields using
dataclasses.replace or direct assignment. Common per-robot overrides
include scene (to add the robot entity and sensors), joint name patterns
inside SceneEntityCfg, action scale, and body names for reward terms.
Note
Isaac Lab uses deeply nested __post_init__ overrides for configuration
inheritance. mjlab avoids that pattern: each ManagerBasedRlEnvCfg is a
flat, inspectable dataclass. A misspelled field name raises a TypeError
at construction rather than silently creating a new attribute. See
Migrating from Isaac Lab for a full comparison.
Where to go next#
The remaining pages in the Manager Layer section cover each manager in detail:
Observations: observation groups, the processing pipeline (clip, scale, noise, delay, history), and built-in observation functions.
Actions: action types and how the action manager routes policy output to actuators.
Rewards: reward terms and scaling by dt.
Terminations: episode end conditions and the truncation/failure distinction.
Commands: command generators and goal-conditioned task setup.
Events: the event manager lifecycle (startup, reset, interval).
Domain Randomization: the full
drmodule for domain randomization.Curriculum: difficulty progression based on policy performance.
Metrics: custom per-step metrics logged as episode averages.