All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- update MWAA to 2.10.1
- update MWAA dependencies
- update ray modules to use kubectl handler role & update CDK to 2.166.0
- update IDF module versions to 1.13.0
- pin MWAA requirements file version
- added
mlflow-ai-gw-image
module
- changed
ray-image
to pull from AWS Public ECR to avoid docker pull rate limits - changed
ray-orchestrator
to not retrieve full training job logs and avoidStates.DataLimitExceeded
- update
ray-on-eks
manifest cluster resources
- added GitHub as code repository option along with AWS CodeCommit for sagemaker templates batch_inference, finetune_llm_evaluation, hf_import_models and xgboost_abalone
- added
ray-orchestrator
module - added GitHub as alternate option for code repository support along with AWS CodeCommit for sagemaker-templates-service-catalog module
- added SageMaker ground truth labeling module
- updated manifests to idf release 1.12.0
- added new manifest
manifests/fine-tuning-6B
- updated mlflow version to 2.16.0 to support LLM tracing
- remove CDK overhead from
mlflow-image
module - renamed mlflow manifests and updated README.MD
- added head tolerations & node labels for flexible ray cluster pods scheduling
- added documentation for MWAA Sagemaker training DAG manifest
- added documentation for Ray on EKS manifests
- added network isolation and inter container encryption for xgboost template
- added partition support for modules:
fmops/sagemaker-jumpstart-fm-endpoint
sagemaker/sagemaker-endpoint
sagemaker/sagemaker-notebook
sagemaker/sagemaker-studio
- added Bedrock fine-tuning manifest
- added accelerate as extra for transformers in finetune llm template
- limited bucket name length in templates to avoid pipeline failures when using long project names
- increased timeout on finetune_llm_evaluation project from 1 hour (default) to 4 hours
- pin
ray-operator
,ray-cluster
, andray-image
modules versions - pin module versions for all manifests
- the
sagemaker/sagemaker-model-package-promote-pipeline
module no longer generates a Docker image - lowercase
fine-tuning-6b
deployment name due to CDK resource naming constraints
- adds workflow specific to changes for
requirements-dev.txt
so all static checks are run - add
ray-cluster
module based onkuberay-helm
charts - added FSx for Lustre to
ray-on-eks
manifest & persistent volume claim toray-cluster
module - added worker tolerations to
ray-cluster
module
- add integration tests for
sagemaker-studio
- bump ecr module version to 1.10.0 to consume auto-delete images feature
- add service account to kuberay
- updated
get-modules
workflow to only run tests against changed files inmodules/**
- Updated the
sagemaker-templates-service-catalog
module documentation to match the code layout. - Modernize
sagemaker-templates-service-catalog
packaging and remove unused dependencies. - remove custom manifests via
dataFiles
fromray-on-eks
- refactor
ray-on-eks
toray-cluster
andray-operator
modules - downscope
ray-operator
service account permissions - add an example custom
ray-image
- document available manifests in readme
- add permission for SM studio to describe apps when domain resource isolation is enabled
- updated
ray-on-eks
manifest to use latest EKS IDF release
- added
ray-on-eks
, andmanifests/ray-on-eks
manifests - added a
sagemaker-model-monitoring-module
module with an example of data quality, model quality, model bias, and model explainability monitoring of a SageMaker Endpoint - added an option to enable data capture in the
sagemaker-endpoint-module
- added a
personas
example module to deploy various roles required for an AI/ML project - added
sagemaker-model-cicd
module - added
sagemaker_domain_arn
as optional input for multiple modules, tags resources created with domain ARN to support domain resource isolation - added
enable_network_isolation
as optional input forsagemaker-endpoint
module, defaults to true - added
enable_domain_resource_isolation
as optional input forsagemaker-studio
module, adds IAM policy to studio roles preventing the access of resources from outside the domain, defaults to true - added
StudioDomainArn
as output fromsagemaker-studio
module - added
enable_network_isolation
as parameter formodel_deploy
template
- remove explicit module manifest account/region mappings from
fmops-qna-rag
- moved CI/CD infra to separate repository and added self mutation pipeline to provision infra for module
sagemaker-templates-service-catalog
- changed ECR encryption to KMS_MANAGED
- changed encryption for each bucket to KMS_MANAGED
- refactor
airflow-dags
module to use Pydantic - fix inputs for
bedrock-finetuning
module not working - add
retention-type
argument for the bucket in thebedrock-finetuning
module - fix broken dependencies for
examples/airflow-dags
- use
add_dependency
to avoid deprecation warnings from CDK - various typo fixes
- various clean-ups to the SageMaker Service Catalog templates
- fix opensearch removal policy
- update MWAA to 2.9.2
- update mwaa constraints
- limit length of id in model name to prevent model name becoming too long
- add permission for get secret value in
hf_import_models
template - add manifests/tags parameters to one-click-template
- add integration tests for
mlflow-image
- added multi-acc sagemaker-mlops manifest example
- fixed model deploy cross-account permissions
- added bucket and model package group names as stack outputs in the
sagemaker-templates
module - refactor inputs for the following modules to use Pydantic:
mlflow-fargate
mlflow-image
sagemaker-studio
sagemaker-endpoint
sagemaker-templates-service-catalog
sagemaker-custom-kernel
qna-rag
- add CDK nag to
qna-rag
module - rename seedfarmer project name to
aiops
- chore: adding some missing auto_delete attributes
- chore: Add
auto_delete
tomlflow-fargate
elb access logs bucket - updating
storage/ecr
module to latest pendingv1.8.0
of IDF - enabled ECR image scan on push
- added managed autoscaling config to
sagemaker-endpoint
module - added SSO support in
sagemaker-studio
module - added VPC/subnets/sg config for multi-account project template to
sagemaker-templates-service-catalog
module - added
sagemaker-custom-kernel
module - added batch inference project template to
sagemaker-templates-service-catalog
module - added EFS removal policy to
mlflow-fargate
module - added
mwaa
module with example dag which demonstrates the MLOps in Airflow - added
sagemaker-model-event-bus
module. - added
sagemaker-model-package-group
module. - added
sagemaker-model-package-promote-pipeline
module. - added
sagemaker-hugging-face-endpoint
module - added
hf_import_models
template to import hugging face models - added
qna-rag
module - added
bedrock-finetuning
module
- reogranized manifests by use-case
- add account/region props for project templates in
sagemaker-templates-service-catalog
module - fix
sagemaker-templates-service-catalog
model deploy role lookup issue & abalone_xgboost model registry permissions - update
sagemaker-custom-kernel
module IAM permissions - split
xgboost_abalone
andmodel_deploy
project templates insagemaker-templates-service-catalog
module - add support for other AWS partitions
- update MySQL instance to use T3 instance type
- upgrade
cdk_ecr_deployment
version to fix the deprecatedgo1.x
lambda runtime
- remove AmazonSageMakerFullAccess from
multi_account_basic
template in thesagemaker-templates-service-catalog
module - remove AmazonSageMakerFullAccess from
sagemaker-endpoint
module
- added
sagemaker-templates-service-catalog
module withmulti_account_basic
project template - bump cdk & ecr deployment version to fix deprecated custom resource runtimes issue in
mlflow-image
- added
sagemaker-jumpstart-fm-endpoint
module - added RDS persistence layer to MLFlow modules
- added
mlflow-image
andmlflow-fargate
modules - added
sagemaker-studio
module - added
sagemaker-endpoint
module - added
sagemaker-notebook
module
- refactor validation script to use
ruff
instead ofblack
andisort