Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking to container image via --template when creating AZ Batch pool #89

Merged
merged 27 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c55f1cd
Using template
gvegayon Oct 22, 2024
0acafd2
Adding the azure batch-cli-extensions extension
gvegayon Oct 23, 2024
3fe2743
Update NEWS.md
gvegayon Oct 23, 2024
5ff4b80
Missing quotes in secret passed to config file
gvegayon Oct 23, 2024
ad311de
Fixing JSON file
gvegayon Oct 23, 2024
62b44b4
Adding missing properties in JSON (but will switch to Python or other…
gvegayon Oct 23, 2024
667469f
Adding azure pool config file (expected to fail) [skip ci]
gvegayon Oct 30, 2024
6cafe44
Adding missing entries in the config file
gvegayon Nov 13, 2024
94a0ea2
Merge branch 'main' into 59-link-built-pool-to-image-in-acr
gvegayon Nov 13, 2024
caf9fbd
Adding an install of azure and toml explicitly
gvegayon Nov 13, 2024
c2260a6
Using pip to install
gvegayon Nov 13, 2024
1ab3a1f
Trying with container python 3.12
gvegayon Nov 13, 2024
88d0a89
Adding requirements
gvegayon Nov 13, 2024
bf97656
Adding freeze [skip ci]
gvegayon Nov 13, 2024
74a9a4a
Fixing how file is read
gvegayon Nov 13, 2024
21825d9
Extra quotes
gvegayon Nov 13, 2024
1053ef0
Cat the secrets directly [skip ci]
gvegayon Nov 20, 2024
783d8b6
Adding a comment on the confi file
gvegayon Nov 20, 2024
849d4d0
Minor error in env.tag
gvegayon Nov 20, 2024
14459fe
Correcting path to autoscale formula
gvegayon Nov 20, 2024
799d096
Wrong length in sysargv
gvegayon Nov 20, 2024
2c0d93d
Wrong path?
gvegayon Nov 21, 2024
6ca5fdd
Adding instructions about the configuration and secrets
gvegayon Nov 22, 2024
c762735
Adding a template
gvegayon Nov 22, 2024
7c14d17
Merge branch '59-link-built-pool-to-image-in-acr' of https://github.c…
gvegayon Nov 22, 2024
888ed2f
Updating README
gvegayon Nov 22, 2024
102802b
Merge branch 'main' into 59-link-built-pool-to-image-in-acr
gvegayon Nov 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 27 additions & 15 deletions .github/workflows/containers-and-az-pool.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ jobs:
with:
push: true # This can be toggled manually for tweaking.
tags: |
${{ env.REGISTRY}}/${{ env.IMAGE_NAME }}:test-${{ needs.build-dependencies-image.outputs.tag }}
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:test-${{ needs.build-dependencies-image.outputs.tag }}
file: ./Dockerfile
build-args: |
TAG=${{ needs.build-dependencies-image.outputs.tag }}
Expand All @@ -125,6 +125,7 @@ jobs:
name: Create Batch Pool and Submit Jobs
runs-on: cfa-cdcgov
needs: build-pipeline-image
container: python:3.12

permissions:
contents: read
Expand All @@ -148,6 +149,25 @@ jobs:
id: checkout_repo
uses: actions/checkout@v4

# This step is only needed during the action to write the
# config file. Users can have a config file stored in their VAP
# sessions. In the future, we will have the config.toml file
# distributed with the repo (encrypted).
- name: Writing out config file
run: |
cat <<EOF > pool-config-${{ github.sha }}.toml
${{ secrets.POOL_CONFIG_TOML }}
EOF

# Replacing placeholders in the config file
sed -i 's|{{ IMAGE_NAME }}|${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:test-${{ env.TAG }}|g' pool-config-${{ github.sha }}.toml
sed -i 's|{{ VM_SIZE }}|${{ env.VM_SIZE }}|g' pool-config-${{ github.sha }}.toml
sed -i 's|{{ POOL_ID }}|${{ env.POOL_ID }}|g' pool-config-${{ github.sha }}.toml

- name: Ensuring the Azure CLI is installed
run: |
apt-get update && apt-get install -y --no-install-recommends azure-cli

- name: Login to Azure with NNH Service Principal
id: azure_login_2
uses: azure/login@v2
Expand Down Expand Up @@ -187,20 +207,12 @@ jobs:

# The call to the az cli that actually generates the pool
run: |
az batch account login \
--resource-group ${{ secrets.PRD_RESOURCE_GROUP }} \
--name "${{ env.BATCH_ACCOUNT }}"

az batch pool create \
--account-endpoint "${{ env.BATCH_ENDPOINT }}" \
--id "${{ env.POOL_ID }}" \
--image "${{ env.VM_IMAGE_TAG }}" \
--node-agent-sku-id "${{ env.NODE_AGENT_SKU_ID }}" \
--vm-size "${{ env.VM_SIZE }}"

az batch pool autoscale enable \
--pool-id ${{ env.POOL_ID }} \
--auto-scale-formula "$(cat './batch-autoscale-formula.txt')"
# Running the python script azure/pool.py passing the config file
# as an argument
pip install -r azure/requirements.txt
python3 azure/pool.py \
pool-config-${{ github.sha }}.toml \
batch-autoscale-formula.txt


#########################################################################
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -384,3 +384,4 @@ docs
# cfa-epinow2-pool-config.json
# for now... will have to gpg encrypt
cfa-epinow2-batch-pool-config.json
azure/*.toml
2 changes: 1 addition & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CFAEpiNow2Pipeline (development version)


* Adding a script to setup the Azure Batch Pool to link the container.
* Adding new action to post a comment on PRs with a link to the rendered pkgdown site.
* Add inner pipeline responsible for running the model fitting process
* Re-organizing GitHub workflows.
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ The project has multiple GitHub Actions workflows to automate the CI/CD process.

- **Create Batch Pool and Submit Jobs** (`batch-pool`): This final job creates a new Azure batch pool with id `cfa-epinow2-pool-[branch name]` if it doesn't already exist. Additionally, if the commit message contains the string "`[delete pool]`", the pool is deleted.

Both container tags and pool ids are based on the branch name, making it compatible with having multiple pipelines running simultaneously.
Both container tags and pool ids are based on the branch name, making it compatible with having multiple pipelines running simultaneously. The pool creation depends on Azure's Python SDK (see the file [azure/pool.py](azure/pool.py)), with the necessary configuration in a toml file stored as a secret in the repository (`POOL_CONFIG_TOML`). A template of the configuration file can be found at [azure/pool-config-template.toml](azure/pool-config-template.toml). The current configuration file is stored in the project's Azure datalake under the name `cfa-epinow2-pipeline-config.toml.toml`.

> [!IMPORTANT]
> The CI will fail with branch names that are not valid tag names for containers. For more information, see the official Azure documentation [here](https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules#microsoftcontainerregistry).
Expand Down
28 changes: 28 additions & 0 deletions azure/pool-config-template.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
[Authentication]
vault_url = ""
vault_sp_secret_id= ""
tenant_id = ""
application_id = ""
subscription_id = ""
user_assigned_identity = ""
client_id = ""
principal_id = ""
subnet_id = ""
resource_group = ""

[Storage]
storage_account_url = ""
storage_account_name = ""
user_assigned_identity = ""

[Container]
container_registry_url = ""
container_registry_username = ""
container_registry_password = ""
container_registry_server = ""
container_image_name = "{{ IMAGE_NAME }}"

[Batch]
pool_vm_size = "{{ VM_SIZE }}"
pool_id = "{{ POOL_ID }}"
batch_account_name = ""
218 changes: 218 additions & 0 deletions azure/pool.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,218 @@
import datetime
import os
import sys
import toml

from azure.identity import (
ChainedTokenCredential,
EnvironmentCredential,
AzureCliCredential,
ClientSecretCredential,
)
from azure.keyvault.secrets import SecretClient
from azure.storage.blob import BlobServiceClient
from azure.mgmt.batch import BatchManagementClient
from azure.core.exceptions import HttpResponseError


def create_container(blob_service_client: BlobServiceClient, container_name: str):
container_client = blob_service_client.get_container_client(
container=container_name
)
if not container_client.exists():
container_client.create_container()
print("Container [{}] created.".format(container_name))
else:
print("Container [{}] already exists.".format(container_name))


def get_autoscale_formula(fn):
with open(fn, "r") as autoscale_text:
return autoscale_text.read()

if __name__ == "__main__":
start_time = datetime.datetime.now()

# Reading a configuration file from the command line
if len(sys.argv) == 3:
config_file = sys.argv[1]
config = toml.load(config_file)
autoscale_fn = sys.argv[2]
else:
raise Exception(
"The function needs two arguments, a path to the config.toml "\
"and a path to the autoscale formula file."
)

# # Load configuration
# config = toml.load("run_azure_batch/configuration.toml")

# Get credential
# First use user credential to access the key vault
credential_order = (EnvironmentCredential(), AzureCliCredential())
user_credential = ChainedTokenCredential(*credential_order)
secret_client = SecretClient(
vault_url=config["Authentication"]["vault_url"],
credential=user_credential,
)
sp_secret = secret_client.get_secret(
config["Authentication"]["vault_sp_secret_id"]
).value
# Get Service Principal credential from key vault
sp_credential = ClientSecretCredential(
tenant_id=config["Authentication"]["tenant_id"],
client_id=config["Authentication"]["application_id"],
client_secret=sp_secret,
)

# Create the Azure Storage Blob Service Client
blob_service_client = BlobServiceClient(
account_url=config["Storage"]["storage_account_url"],
credential=sp_credential,
)

# Create the blob storage container for this batch job
# [2024-11-20 George] We are not using this now, so it should be
# removed before merging the PR.
input_container_name = "nnh-rt-input"
create_container(blob_service_client, input_container_name)
output_container_name = "nnh-rt-output"
create_container(blob_service_client, output_container_name)

# Create the Azure Batch Management client
batch_mgmt_client = BatchManagementClient(
credential=sp_credential,
subscription_id=config["Authentication"]["subscription_id"],
)

# Define the JSON for the batch pool creation request

# User-assigned identity for the pool
user_identity = {
"type": "UserAssigned",
"userAssignedIdentities": {
config["Authentication"]["user_assigned_identity"]: {
"clientId": config["Authentication"]["client_id"],
"principalId": config["Authentication"]["principal_id"],
}
},
}

# Network configuration with no public IP and virtual network
network_config = {
"subnetId": config["Authentication"]["subnet_id"],
"publicIPAddressConfiguration": {"provision": "NoPublicIPAddresses"},
"dynamicVnetAssignmentScope": "None",
}

# Virtual machine configuration
deployment_config = {
"virtualMachineConfiguration": {
"imageReference": {
"publisher": "microsoft-azure-batch",
"offer": "ubuntu-server-container",
"sku": "20-04-lts",
"version": "latest",
},
"nodeAgentSkuId": "batch.node.ubuntu 20.04",
"containerConfiguration": {
"type": "dockercompatible",
"containerImageNames": [config["Container"]["container_image_name"]],
"containerRegistries": [
{
"registryServer": config["Container"]["container_registry_url"],
"userName": config["Container"]["container_registry_username"],
"password": config["Container"]["container_registry_password"],
"registryServer": config["Container"][
"container_registry_server"
],
# "registryServer": config["Container"]["container_registry_url"],
# "identityReference": {
# "resourceId": config["Authentication"][
# "user_assigned_identity"
# ]
# },
}
],
},
}
}

# Mount configuration
mount_config = [
{
"azureBlobFileSystemConfiguration": {
"accountName": config["Storage"]["storage_account_name"],
"identityReference": {
"resourceId": config["Authentication"]["user_assigned_identity"]
},
"containerName": "nnh-rt-input",
"blobfuseOptions": "-o direct_io",
"relativeMountPath": "input",
}
},
{
"azureBlobFileSystemConfiguration": {
"accountName": config["Storage"]["storage_account_name"],
"identityReference": {
"resourceId": config["Authentication"]["user_assigned_identity"]
},
"containerName": "nnh-rt-output",
"blobfuseOptions": "-o direct_io",
"relativeMountPath": "output",
}
},
]

# Assemble the pool parameters JSON
pool_parameters = {
"identity": user_identity,
"properties": {
"vmSize": config["Batch"]["pool_vm_size"],
"interNodeCommunication": "Disabled",
"taskSlotsPerNode": 1,
"taskSchedulingPolicy": {"nodeFillType": "Spread"},
"deploymentConfiguration": deployment_config,
"networkConfiguration": network_config,
"scaleSettings": {
# "fixedScale": {
# "targetDedicatedNodes": 1,
# "targetLowPriorityNodes": 0,
# "resizeTimeout": "PT15M"
# }
"autoScale": {
"evaluationInterval": "PT5M",
"formula": get_autoscale_formula(autoscale_fn),
}
},
"resizeOperationStatus": {
"targetDedicatedNodes": 1,
"nodeDeallocationOption": "Requeue",
"resizeTimeout": "PT15M",
"startTime": "2023-07-05T13:18:25.7572321Z",
},
"currentDedicatedNodes": 1,
"currentLowPriorityNodes": 0,
"targetNodeCommunicationMode": "Simplified",
"currentNodeCommunicationMode": "Simplified",
"mountConfiguration": mount_config,
},
}

pool_id = config["Batch"]["pool_id"]
account_name = config["Batch"]["batch_account_name"]
resource_group_name = config["Authentication"]["resource_group"]

try:
batch_mgmt_client.pool.create(
resource_group_name=resource_group_name,
account_name=account_name,
pool_name=pool_id,
parameters=pool_parameters,
)
print(f"Pool {pool_id!r} created")
except HttpResponseError as error:
if "PropertyCannotBeUpdated" in error.message:
print(f"Pool {pool_id!r} already exists")
else:
raise error
11 changes: 11 additions & 0 deletions azure/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
azure-common==1.1.28
azure-core==1.32.0
azure-identity==1.19.0
azure-keyvault==4.2.0
azure-keyvault-certificates==4.9.0
azure-keyvault-keys==4.10.0
azure-keyvault-secrets==4.9.0
azure-mgmt-batch==18.0.0
azure-mgmt-core==1.5.0
azure-storage-blob==12.24.0
toml==0.10.2
Loading