Cloud Training#
This guide walks through launching training jobs on Lambda Cloud using SkyPilot. SkyPilot provisions a GPU instance, syncs your code, runs the job, and tears down the machine when it finishes.
Two SkyPilot task files live in scripts/cloud/:
File |
Description |
|---|---|
|
Installs mjlab directly with uv. |
|
Pulls the pre-built Docker image from GHCR for a reproducible environment. |
Prerequisites#
1. Install SkyPilot
SkyPilot is a local CLI tool, not a project dependency. Install it with:
uv tool install "skypilot[lambda]"
2. Lambda Cloud API key
Generate a key at Lambda Cloud API keys. Name it after your
machine (e.g. kevins-macbook) so you can tell keys apart later.
mkdir -p ~/.lambda_cloud && chmod 700 ~/.lambda_cloud
echo "api_key = <your-api-key>" > ~/.lambda_cloud/lambda_keys
chmod 600 ~/.lambda_cloud/lambda_keys
3. Verify setup
sky check lambda
You should see Lambda listed as an enabled cloud.
4. W&B credentials (optional)
If you log to Weights & Biases, install the wandb CLI and log in:
uv tool install wandb
wandb login
This stores your credentials in ~/.netrc. The SkyPilot task files
mount this file onto the remote instance via file_mounts so that
wandb authenticates automatically, no environment variable needed.
Quick start#
From the repo root:
sky launch scripts/cloud/train.yaml \
--env TASK=Mjlab-Velocity-Flat-Unitree-G1
# Or with Docker:
sky launch scripts/cloud/train-docker.yaml \
--env TASK=Mjlab-Velocity-Flat-Unitree-G1
What happens behind the scenes:
SkyPilot finds an available Lambda instance with the requested GPU.
It provisions the instance and uploads your local code via rsync.
The
setupstep runs (uv install or Docker pull).The
runstep runs (training).After 5 minutes of idle time the instance is terminated automatically.
Warning
Lambda instances can only be launched or terminated. There is
no pause or suspend. Do not run sudo shutdown from inside the
instance; it will put the machine in an alert state and billing will
continue. Always use sky down to terminate.
Common operations#
List available GPUs
sky show-gpus --infra lambda
Choose a different GPU
sky launch scripts/cloud/train.yaml --gpus H100:1 # 1x H100
sky launch scripts/cloud/train.yaml --gpus A100:8 # 8x A100
sky launch scripts/cloud/train.yaml --gpus A10:1 # 1x A10 (cheaper)
Note
Both task files pass --gpu-ids all, so multi-GPU instances
automatically use distributed training.
When requesting more than one GPU, consider scaling
MAX_ITERATIONS down proportionally. See
Distributed Training for details on scaling behavior.
Override training parameters
Every variable in the YAML envs block can be overridden from the
command line with --env:
sky launch scripts/cloud/train.yaml \
--env TASK=Mjlab-Velocity-Flat-Unitree-Go1 \
--env NUM_ENVS=8192 \
--env MAX_ITERATIONS=10000
Run your own task
sky launch scripts/cloud/train.yaml \
--env TASK=Mjlab-Velocity-Flat-Unitree-Go1
To see all registered tasks:
uv run list_envs
uv run list_envs --keyword Velocity # filter by keyword
Hyperparameter sweeps#
Use W&B Sweeps with SkyPilot to search hyperparameters across a multi-GPU instance. The sweep controller lives on the W&B servers; each GPU on the instance runs an independent sweep agent that pulls a hyperparameter configuration, trains, and reports metrics.
The example uses method: random, where each agent samples
independently. Bayesian search also works well with parallel agents.
Agents report results back as they finish and the controller updates its
model between rounds. If using Bayesian, set run_cap high enough for
the optimizer to go through several rounds.
Four files are involved:
File |
Description |
|---|---|
|
W&B sweep configuration (parameters, search method, metric). |
|
SkyPilot cluster definition (resources, setup, no run section). |
|
SkyPilot job definition that runs |
|
Convenience script that creates the sweep, provisions the cluster, and submits one agent per GPU. |
Quick start
./scripts/cloud/sweep-launch.sh A100:8 # 8 agents on an 8xA100
This creates a W&B sweep, provisions a cluster, and submits one agent per GPU. Each agent runs training with a different set of hyperparameters sampled by the sweep controller.
Manual steps (if you prefer more control):
# 1. Create the sweep (returns a SWEEP_ID).
wandb sweep scripts/cloud/sweep.yaml
# 2. Provision the cluster (runs setup, no agents yet).
sky launch scripts/cloud/sweep-cluster.yaml \
-c mjlab-sweep --gpus A100:8
# 3. Submit one agent per GPU.
sky exec mjlab-sweep scripts/cloud/sweep-agent.yaml \
--gpus A100:1 --env SWEEP_ID=<entity/project/sweep_id> -d
Monitor progress on the W&B dashboard or with sky queue mjlab-sweep.
When done, tear down the cluster with sky down mjlab-sweep.
Monitoring#
Provisioning can take five minutes or more while Lambda allocates the instance. Open a second terminal to keep an eye on things:
sky status # cluster state (INIT, UP, ...)
sky logs sky-<cluster-name> # stream logs in real time
sky logs sky-<cluster-name> --no-follow # print current logs and exit
sky queue sky-<cluster-name> # job queue for the cluster
Tip
If the cluster stays in INIT for a long time, the GPU type is
likely sold out. Cancel with sky down and try a different GPU, or
add --retry-until-up to let SkyPilot keep polling until capacity
opens up.
sky down sky-<cluster-name>
sky launch scripts/cloud/train.yaml --retry-until-up
Iterating on a failed job#
When a job fails the cluster keeps running (and billing). You can fix the problem locally and resubmit without waiting for a new instance:
sky exec sky-<cluster-name> scripts/cloud/train.yaml
Important
sky exec rsyncs your latest code and reruns the run step
only. It does not rerun setup. If your fix involves
dependency changes, use sky launch again or SSH in and run the
setup commands manually.
Other useful commands:
sky down sky-<cluster-name> # terminate the instance immediately
ssh sky-<cluster-name> # SSH in (SkyPilot configures this for you)
Cost management#
Warning
Always run sky status after each session to confirm nothing is
still running. Forgotten instances are the most common source of
unexpected charges. To terminate everything at once: sky down -a.
Instances auto-terminate after 5 minutes of idle time by default. You can change this in the YAML (
idle_minutes) or at launch time with--idle-minutes-to-autostop.The
down: truesetting in the YAML means the instance is fully terminated when it stops, not just paused. Billing stops completely.
Troubleshooting#
No instances available
Lambda GPUs sell out frequently. A few things to try:
Use
--retry-until-upto poll automatically.Try a different GPU type:
--gpus A100:1,--gpus A10:1, etc.If you have credentials for other clouds (GCP, AWS), SkyPilot can fall back to them automatically.