This doc is still under construction.
This document describes the various components that power AXLearn training on the public cloud.
The Bastion is a simple orchestrator which supports flexible job scheduling and quota management. It is general purpose (i.e., it supports scheduling jobs with arbitrary bash workloads), but also provides out-of-box functionality for provisioning, running, and monitoring large TPU slices.
While the bastion currently only supports Google Cloud Platform (GCP) jobs, its design is cloud agnostic, and in theory, it can be extended to run on other cloud providers.
Its main dependencies are:
- A cloud bucket for reading and writing job metadata and logs.
- This should be compatible with
tensorflow.io
. - This should be writable by the bastion and anyone who intends to submit jobs.
- This should be compatible with
- A cloud bucket (possibly the same bucket) for reading quota information.
- This should also be compatible with
tensorflow.io
. - This should be readable by the bastion and anyone who intends to submit jobs.
- This should also be compatible with
- A docker repo to pull the bastion container image from.
- A single VM to run on.
Please see below for a high-level diagram of the bastion job submission.
The following sections guide you through setting up and launching a job via Bastion.
We assume you have:
- Followed the getting started setup.
- A docker repo that can be accessed from Bastion. For the purposes of this doc, we assume this repo lives in Artifact Registry.
We also assume you have "activated" a project config using axlearn gcp config activate
. This is mainly a convenience so that you do not have to specify mundane flags like --project
and --zone
in every command. Please refer to the CLI docs for more details.
This section will setup a bastion named $USER-bastion
, but feel free to use a different name, as long as it isn't too long (VM names are capped at 63 chars) and ends with -bastion
.
As usual, please make sure you have authenticated to GCP:
# Authenticate to GCP.
axlearn gcp auth
We first build a Docker image using the default Dockerfile
in the repo:
# Create and push a bastion image to the Artifact Registry.
axlearn gcp bundle --name=$USER-bastion --bundler_type=artifactregistry \
--bundler_spec=image=base --bundler_spec=dockerfile=Dockerfile --bundler_spec=target=bastion
For more details behind the
--bundler_*
flags, please refer to the CLI docs.
Next, we create a quota file, which defines things like quota groups and membership. A quota file has the following format:
[toml-schema]
version = "1"
[total_resources]
v4 = 1024
[project_membership]
project1 = ["user1", "user2"]
project2 = ["user1", "user3"]
[project_resources.project1]
v4 = 0.6
[project_resources.project2]
v4 = 0.4
In the above example, we configure a pool of 1024
v4 TPU cores, where 60% of the pool is reserved for project1
and 40% is reserved for project2
. Note that these are not hard limits -- if project1
underutilizes its share of the pool, project2
can utilize the spare resources on a "best effort" basis. If utilization goes back up for project1
, some jobs in project2
may be pre-empted.
In short, the quota file is a toml
file that specifies:
- The total quota pool by resource type;
- The membership by project;
- The per-project resources, expressed as fractions of the total pool.
Once ready, upload the file to the private_bucket
configured when preparing the CLI, under the following path:
gs://PRIVATE_BUCKET/$USER-bastion/project-quotas/project-quotas.config
Note that the $USER-bastion
must match the bastion name that you picked above.
Finally, launch the bastion (which runs as a single VM):
# Creates a VM and runs the bastion image on it.
axlearn gcp bastion create --name=$USER-bastion
Once again, the name of the bastion should match the name of the bundle produced above, as well as the path in the quota file.
When the bastion is booted, you can view logs at:
gsutil cat gs://PERMANENT_BUCKET/$USER-bastion/logs/$USER-bastion
Where the bucket name comes from the permanent_bucket
configured when preparing the CLI.
For more details on useful log paths, run axlearn gcp bastion --help
.
Once the bastion is up and running, you can submit arbitrary jobs for it to run. These jobs are essentially BastionJobSpec
s serialized as json blobs, typically constructed via Python scripting using the bastion API[1].
In most cases, you can use the axlearn gcp launch
command which handles most of these details for you.
For example, to launch a Python command with axlearn gcp launch
:
# Launch a v4-32 job via Bastion.
# Note: the "'...'" quotes are important.
axlearn gcp launch --instance_type=tpu-v4-32 --bastion=$USER-bastion -- python3 -c "'import jax; print(jax.devices())'"
This submits a BastionJobSpec
with the command python3 -c 'import jax; print(jax.devices())'
, to be scheduled and run on a TPU v4-32 slice.
The job itself will have two different types of logs:
- The bastion log: this contains logs visible from the bastion, such as the job and TPU statuses.
- The TPU log(s): this contains logs visible from the actual TPUs, such as the actual job execution outputs.
To launch a command using a specific project's quota, run:
# Launch a v4-32 job via bastion under project1.
# Note: the "'...'" quotes are important.
axlearn gcp launch --instance_type=tpu-v4-32 --bastion=$USER-bastion --project_id=project1 --user_id=user1 -- python3 -c "'import jax; print(jax.devices())'"
For more details on the launch command, run axlearn gcp launch --help
.
[1] Refer to new_jobspec and serialize_jobspec as a starting point.
%% elk seems to be more maintained, see: https://github.com/mermaid-js/mermaid/issues/1969
%% N.B. elk doesn't stop rendering invisible edge operator, i.e. ~~~
%%{init: {"flowchart": {"defaultRenderer": "elk"}} }%%
flowchart TB
subgraph UserMachine ["User's dev machine (e.g. MacBook Pro)"]
localAXLearnPackage(["axlearn package"]):::fileCSS
end
localAXLearnPackage --"
Bundle/upload
the user's axlearn dir
(minus excluded paths)"--> bastionPrimaryStore
localAXLearnPackage =="
Submit a bastion job
(serialized as a job spec)"==> bastionPrimaryStore
subgraph PublicCloud ["Public Cloud (e.g. Google Cloud Platform)"]
subgraph BastionVM_shared ["Bastion VM (e.g. 'shared-bastion')"]
bastionScheduler_1["Bastion \n Scheduler"]
bastionVmAXLearnPackage(["axlearn package \n (running on shared docker image)"]):::fileCSS
bastionJob_1["Bastion job 1 \n name: notebook-tpu-alice-a59ce1"]
bastionJob_2["Bastion job 2 \n name: notebook-tpu-bob-b3b5f1"]
bastionScheduler_1 --"Spawn/kill"--> bastionJob_1 & bastionJob_2
end
bastionPrimaryStore[("Data Store \n (e.g. Google Storage)")]:::storeCSS
bastionPrimaryStore =="Download \n Bastion job specs"==> bastionScheduler_1
bastionPrimaryStore --"Download Alice's \n axlearn bundle"--> WorkerVM_1
bastionPrimaryStore --"Download Bob's \n axlearn bundle"--> WorkerVM_2
bastionJob_1 --"Provisions/Monitors \n (using Alice's bundle)"--> WorkerVM_1
bastionJob_2 --"Provisions/Monitors \n (using Bob's bundle)"--> WorkerVM_2
subgraph WorkerVM_1 ["Worker VM 1 (name: notebook-tpu-alice-a59ce1)"]
workerProcess_1["User-specified process \n e.g. `jupyter lab --port=12345`"]
accelerator_1[/"hardware accelerators \n (e.g. TPU v4-8)"\]:::chipCSS
bastionWorkerAXLearnPackage_1(["axlearn package \n (built by Alice)"]):::fileCSS
workerProcess_1 --> accelerator_1
end
subgraph WorkerVM_2 ["Worker VM 2"]
workerProcess_2["..."]
accelerator_2[/"..."\]:::chipCSS
bastionWorkerAXLearnPackage_2(["..."]):::fileCSS
workerProcess_2 --> accelerator_2
end
bastionLogStore[("Log Store \n (e.g. Google Storage)")]:::storeCSS
WorkerVM_1 & WorkerVM_2 --"Sync logs"--> bastionLogStore
end
bastionLogStore--"Download logs for debug"-->UserMachine
%% Public Github doesn't apply CSS, but keeping these enables other environments
%% (e.g. editor previews such as in VS Code) to use more colors
classDef chipCSS stroke:#333,fill:#163
classDef fileCSS stroke:#111,fill:#333
classDef storeCSS fill:#37d