Rewards#
Rewards are the training signal that shapes policy behavior. Each reward term is a function that returns a per-environment scalar every step. The reward manager computes a weighted sum of all terms and returns it to the training framework.
Each term is registered by name with a RewardTermCfg that carries
the callable and a weight. Negative weights produce penalties.
Additional keyword arguments are supplied through params.
from mjlab.envs.mdp import rewards
from mjlab.managers.reward_manager import RewardTermCfg
from mjlab.managers.scene_entity_config import SceneEntityCfg
rewards_cfg = {
"alive": RewardTermCfg(func=rewards.is_alive, weight=1.0),
"joint_torques": RewardTermCfg(
func=rewards.joint_torques_l2,
weight=-1e-4,
params={"asset_cfg": SceneEntityCfg("robot")},
),
}
Built-in reward functions#
The functions below are available in mjlab.envs.mdp.rewards and are
shared across tasks. Individual tasks also define their own reward
functions tailored to the task objective (e.g. velocity tracking for
locomotion). All reward functions return a tensor of shape
[num_envs].
Function |
Description |
|---|---|
|
Returns |
|
Returns |
|
Sum of squared actuator forces. Penalizes energy-intensive actions. |
|
Sum of squared joint velocities. |
|
Sum of squared joint accelerations. |
|
Sum of squared differences between the current and previous action. Penalizes rapid changes in the policy output. |
|
Sum of squared second-order action differences. Penalizes high-frequency jitter in the action signal. |
|
Penalty for joint positions exceeding the soft limits. Zero when all joints are within limits. |
|
Exponential kernel measuring deviation from the default joint
positions: |
|
Sum of positive mechanical power consumed by actuators. Regenerative power is not penalized. |
|
Sum of squares of the x and y components of the projected gravity vector in the base frame. Zero when perfectly upright. |
Reward scaling by dt#
ManagerBasedRlEnvCfg.scale_rewards_by_dt is True by default.
When enabled, the reward manager multiplies each term by the environment
step duration before accumulating it. This makes episodic reward totals
invariant to simulation frequency: a task running at 50 Hz produces the
same expected episode return as the same task at 200 Hz, because each
step contributes proportionally less to the total.
Per-term episodic sums are logged as Episode_Reward/<term_name> and
are always divided by the episode duration, giving a reward rate that is
comparable across runs with different episode lengths.
Writing custom reward functions#
A reward function accepts env as its first argument and returns a
[num_envs] tensor. Additional parameters are declared as function
arguments and supplied via RewardTermCfg(params={...}). When a term
needs to cache setup work or maintain per-episode state, implement it as
a class. See Term configuration pattern for the general pattern.