Distributed Training#
mjlab supports multi-GPU distributed training using torchrunx. Distributed training parallelizes RL workloads across multiple GPUs by running independent rollouts on each device and synchronizing gradients during policy updates. Throughput scales nearly linearly with GPU count.
TL;DR#
Single GPU (default):
uv run train <task-name> <task-specific CLI args>
# or explicitly: --gpu-ids 0
Multi-GPU:
uv run train <task-name> \
--gpu-ids 0 1 \
<task-specific CLI args>
All GPUs:
uv run train <task-name> \
--gpu-ids all \
<task-specific CLI args>
CPU mode:
uv run train <task-name> \
--gpu-ids None \
<task-specific CLI args>
# or: CUDA_VISIBLE_DEVICES="" uv run train <task-name> ...
Key points:
--gpu-idsspecifies GPU indices (e.g.,--gpu-ids 0 1for 2 GPUs)GPU indices are relative to
CUDA_VISIBLE_DEVICESif setCUDA_VISIBLE_DEVICES=2,3 uv run train ... --gpu-ids 0 1uses physical GPUs 2 and 3Each GPU runs the full
num-envscount (e.g., 2 GPUs × 4096 envs = 8192 total)Single-GPU and CPU modes run directly; multi-GPU uses
torchrunxfor process spawning
Configuration#
torchrunx Logging:
By default, torchrunx process logs are saved to {log_dir}/torchrunx/. You can
customize this:
# Disable torchrunx file logging.
uv run train <task-name> --gpu-ids 0 1 --torchrunx-log-dir ""
# Custom log directory.
uv run train <task-name> --gpu-ids 0 1 --torchrunx-log-dir /path/to/logs
# Or use environment variable (takes precedence over flag).
TORCHRUNX_LOG_DIR=/tmp/logs uv run train <task-name> --gpu-ids 0 1
The priority is TORCHRUNX_LOG_DIR env var, --torchrunx-log-dir flag, default
{log_dir}/torchrunx.
Single-Writer Operations:
Only rank 0 performs file I/O operations (config files, videos, wandb logging) to avoid race conditions. All workers participate in training, but logging artifacts are written once by the main process.
How It Works#
mjlab’s role is simple: isolate mjwarp simulations on each GPU using
wp.ScopedDevice. This ensures each process’s environments stay on their
assigned device. torchrunx handles the rest.
Process spawning. Multi-GPU training uses torchrunx.Launcher(...).run(...)
to spawn N independent processes (one per GPU) and sets environment variables
(RANK, LOCAL_RANK, WORLD_SIZE) to coordinate them. Each process executes
the training function with its assigned GPU.
Independent rollouts. Each process maintains its own:
Environment instances (with
num-envsparallel environments), isolated on its assigned GPU viawp.ScopedDevicePolicy network copy
Experience buffer (sized
num_steps_per_env × num-envs)
Each process uses seed = cfg.seed + local_rank to ensure different random
experiences across GPUs, increasing sample diversity.
Gradient synchronization. During the update phase, rsl_rl synchronizes
gradients after each mini-batch through its reduce_parameters() method:
Each process computes gradients independently on its local mini-batch
All policy gradients are flattened into a single tensor
torch.distributed.all_reduceaverages gradients across all GPUsAveraged gradients are copied back to each parameter, keeping policies synchronized