This tool is a wrapper around kfp and google-cloud-aiplatform that allows you to check, compile, upload, run, and schedule Vertex Pipelines in a standardized manner.
π Table of Contents
Three use cases:
- CI: Check pipeline validity.
- Dev mode: Quickly iterate over your pipelines by compiling and running them in multiple environments (test, dev, staging, etc.) without duplicating code or searching for the right kfp/aiplatform snippet.
- CD: Deploy your pipelines to Vertex Pipelines in a standardized manner in your CD with Cloud Build or GitHub Actions.
Two main commands:
check
: Check your pipelines (imports, compile, check configs validity against pipeline definition).deploy
: Compile, upload to Artifact Registry, run, and schedule your pipelines.
- Unix-like environment (Linux, macOS, WSL, etc.)
- Python 3.8 to 3.10
- Google Cloud SDK
- A GCP project with Vertex Pipelines enabled
pip install vertex-deployer
Stable version:
pip install git+https://github.com/artefactory/vertex-pipelines-deployer.git@main
Develop version:
pip install git+https://github.com/artefactory/vertex-pipelines-deployer.git@develop
If you want to test this package on examples from this repo:
git clone [email protected]:artefactory/vertex-pipelines-deployer.git
poetry install
poetry shell # if you want to activate the virtual environment
cd example
- Setup your GCP environment:
export PROJECT_ID=<gcp_project_id>
gcloud config set project $PROJECT_ID
gcloud auth login
gcloud auth application-default login
- You need the following APIs to be enabled:
- Cloud Build API
- Artifact Registry API
- Cloud Storage API
- Vertex AI API
gcloud services enable \
cloudbuild.googleapis.com \
artifactregistry.googleapis.com \
storage.googleapis.com \
aiplatform.googleapis.com
- Create an artifact registry repository for your base images (Docker format):
export GAR_DOCKER_REPO_ID=<your_gar_repo_id_for_images>
export GAR_LOCATION=<your_gar_location>
gcloud artifacts repositories create ${GAR_DOCKER_REPO_ID} \
--location=${GAR_LOCATION} \
--repository-format=docker
-
Build and upload your base images to the repository. To do so, please follow Google Cloud Build documentation.
-
Create an artifact registry repository for your pipelines (KFP format):
export GAR_PIPELINES_REPO_ID=<your_gar_repo_id_for_pipelines>
gcloud artifacts repositories create ${GAR_PIPELINES_REPO_ID} \
--location=${GAR_LOCATION} \
--repository-format=kfp
- Create a GCS bucket for Vertex Pipelines staging:
export GCP_REGION=<your_gcp_region>
export VERTEX_STAGING_BUCKET_NAME=<your_bucket_name>
gcloud storage buckets create gs://${VERTEX_STAGING_BUCKET_NAME} --location=${GCP_REGION}
- Create a service account for Vertex Pipelines:
export VERTEX_SERVICE_ACCOUNT_NAME=foobar
export VERTEX_SERVICE_ACCOUNT="${VERTEX_SERVICE_ACCOUNT_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
gcloud iam service-accounts create ${VERTEX_SERVICE_ACCOUNT_NAME}
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
--role="roles/aiplatform.user"
gcloud storage buckets add-iam-policy-binding gs://${VERTEX_STAGING_BUCKET_NAME} \
--member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
--role="roles/storage.objectUser"
gcloud artifacts repositories add-iam-policy-binding ${GAR_PIPELINES_REPO_ID} \
--location=${GAR_LOCATION} \
--member="serviceAccount:${VERTEX_SERVICE_ACCOUNT}" \
--role="roles/artifactregistry.admin"
You can use the deployer CLI (see example below) or import VertexPipelineDeployer
in your code (try it yourself).
You must respect the following folder structure. If you already follow the Vertex Pipelines Starter Kit folder structure, it should be pretty smooth to use this tool:
vertex
ββ configs/
β ββ {pipeline_name}
β ββ {config_name}.json
ββ pipelines/
ββ {pipeline_name}.py
!!! tip "About folder structure"
You must have at least these files. If you need to share some config elements between pipelines,
you can have a shared
folder in configs
and import them in your pipeline configs.
If you're following a different folder structure, you can change the default paths in the `pyproject.toml` file.
See [Configuration](#configuration) section for more information.
Your file {pipeline_name}.py
must contain a function called {pipeline_name}
decorated using kfp.dsl.pipeline
.
In previous versions, the functions / object used to be called pipeline
but it was changed to {pipeline_name}
to avoid confusion with the kfp.dsl.pipeline
decorator.
# vertex/pipelines/dummy_pipeline.py
import kfp.dsl
# New name to avoid confusion with the kfp.dsl.pipeline decorator
@kfp.dsl.pipeline()
def dummy_pipeline():
...
# Old name
@kfp.dsl.pipeline()
def pipeline():
...
Config file can be either .py
, .json
, .toml
or yaml
format.
They must be located in the config/{pipeline_name}
folder.
Why multiple formats?
.py
files are useful to define complex configs (e.g. a list of dicts) while .json
/ .toml
/ yaml
files are useful to define simple configs (e.g. a string).
It also adds flexibility to the user and allows you to use the deployer with almost no migration cost.
How to format them?
-
.py
files must be valid python files with two important elements:parameter_values
to pass arguments to your pipelineinput_artifacts
if you want to retrieve and create input artifacts to your pipeline. See Vertex Documentation for more information.
-
.json
files must be valid json files containing only one dict of key: value representing parameter values. -
.toml
files must be the same. Please note that TOML sections will be flattened, except for inline tables. Section names will be joined using"_"
separator and this is not configurable at the moment. Example:=== "TOML file"
toml [modeling] model_name = "my-model" params = { lambda = 0.1 }
=== "Resulting parameter values"
python { "modeling_model_name": "my-model", "modeling_params": { "lambda": 0.1 } }
-
.yaml
files must be valid yaml files containing only one dict of key: value representing parameter values.
??? question "Why are sections flattened when using TOML config files?"
Vertex Pipelines parameter validation and parameter logging to Vertex Experiments are based on the parameter name.
If you do not flatten your sections, you'll only be able to validate section names and that they should be of type dict
.
Not very useful.
??? question "Why aren't input_artifacts
supported in TOML / JSON config files?"
Because it's low on the priority list. Feel free to open a PR if you want to add it.
How to name them?
{config_name}.py
or {config_name}.json
or {config_name}.toml
. config_name is free but must be unique for a given pipeline.
You will also need the following ENV variables, either exported or in a .env
file (see example in example.env
):
PROJECT_ID=YOUR_PROJECT_ID # GCP Project ID
GCP_REGION=europe-west1 # GCP Region
GAR_LOCATION=europe-west1 # Google Artifact Registry Location
GAR_PIPELINES_REPO_ID=YOUR_GAR_KFP_REPO_ID # Google Artifact Registry Repo ID (KFP format)
VERTEX_STAGING_BUCKET_NAME=YOUR_VERTEX_STAGING_BUCKET_NAME # GCS Bucket for Vertex Pipelines staging
VERTEX_SERVICE_ACCOUNT=YOUR_VERTEX_SERVICE_ACCOUNT # Vertex Pipelines Service Account
!!! note "About env files"
We're using env files and dotenv to load the environment variables.
No default value for --env-file
argument is provided to ensure that you don't accidentally deploy to the wrong project.
An example.env
file is provided in this repo.
This also allows you to work with multiple environments thanks to env files (test.env
, dev.env
, prod.env
, etc)
Let's say you defined a pipeline in dummy_pipeline.py
and a config file named config_test.json
. You can deploy your pipeline using the following command:
vertex-deployer deploy dummy_pipeline \
--compile \
--upload \
--run \
--env-file example.env \
--tags my-tag \
--config-filepath vertex/configs/dummy_pipeline/config_test.json \
--experiment-name my-experiment \
--enable-caching \
--skip-validation
To check that your pipelines are valid, you can use the check
command. It uses a pydantic model to:
- check that your pipeline imports and definition are valid
- check that your pipeline can be compiled
- check that all configs related to the pipeline are respecting the pipeline definition (using a Pydantic model based on pipeline signature)
To validate one or multiple pipeline(s):
vertex-deployer check dummy_pipeline <other pipeline name>
To validate all pipelines in the vertex/pipelines
folder:
vertex-deployer check --all
You can check your vertex-deployer
configuration options using the config
command.
Fields set in pyproject.toml
will overwrite default values and will be displayed differently:
vertex-deployer config --all
You can create all files needed for a pipeline using the create
command:
vertex-deployer create my_new_pipeline --config-type py
This will create a my_new_pipeline.py
file in the vertex/pipelines
folder and a vertex/config/my_new_pipeline/
folder with multiple config files in it.
To initialize the deployer with default settings and folder structure, use the init
command:
vertex-deployer init
$ vertex-deployer init
Welcome to Vertex Deployer!
This command will help you getting fired up.
Do you want to configure the deployer? [y/n]: n
Do you want to build default folder structure [y/n]: n
Do you want to create a pipeline? [y/n]: n
All done β¨
You can list all pipelines in the vertex/pipelines
folder using the list
command:
vertex-deployer list --with-configs
vertex-deployer --help
To see package version:
vertex-deployer --version
To adapt log level, use the --log-level
option. Default is INFO
.
vertex-deployer --log-level DEBUG deploy ...
You can configure the deployer using the pyproject.toml
file to better fit your needs.
This will overwrite default values. It can be useful if you always use the same options, e.g. always the same --scheduler-timezone
[tool.vertex_deployer]
vertex_folder_path = "my/path/to/vertex"
log_level = "INFO"
[tool.vertex_deployer.deploy]
scheduler_timezone = "Europe/Paris"
You can display all the configurable parameterss with default values by running:
$ vertex-deployer config --all
'*' means the value was set in config file
* vertex_folder_path=my/path/to/vertex
* log_level=INFO
deploy
env_file=None
compile=True
upload=False
run=False
schedule=False
cron=None
delete_last_schedule=False
* scheduler_timezone=Europe/Paris
tags=['latest']
config_filepath=None
config_name=None
enable_caching=False
experiment_name=None
check
all=False
config_filepath=None
raise_error=False
list
with_configs=True
create
config_type=json
ββ .github
β ββ ISSUE_TEMPLATE/
β ββ workflows
β β ββ ci.yaml
β β ββ pr_agent.yaml
β β ββ release.yaml
β ββ CODEOWNERS
β ββ PULL_REQUEST_TEMPLATE.md
ββ deployer # Source code
β ββ __init__.py
β ββ cli.py
β ββ constants.py
β ββ pipeline_checks.py
β ββ pipeline_deployer.py
β ββ settings.py
β ββ utils
β ββ config.py
β ββ console.py
β ββ exceptions.py
β ββ logging.py
β ββ models.py
β ββ utils.py
ββ docs/ # Documentation folder (mkdocs)
ββ templates/ # Semantic Release templates
ββ tests/
ββ example # Example folder with dummy pipeline and config
| ββ example.env
β ββ vertex
β ββ components
β β ββ dummy.py
β ββ configs
β β ββ broken_pipeline
β β β ββ config_test.json
β β ββ dummy_pipeline
β β ββ config_test.json
β β ββ config.py
β β ββ config.toml
β ββ deployment
β ββ lib
β ββ pipelines
β ββ broken_pipeline.py
β ββ dummy_pipeline.py
ββ .gitignore
ββ .pre-commit-config.yaml
ββ catalog-info.yaml # Roadie integration configuration
ββ CHANGELOG.md
ββ CONTRIBUTING.md
ββ LICENSE
ββ Makefile
ββ mkdocs.yml # Mkdocs configuration
ββ pyproject.toml
ββ README.md