From 9063f609026143722c470008ae69d3f6e8496287 Mon Sep 17 00:00:00 2001 From: Kevin Stofan Date: Thu, 15 Feb 2024 03:35:34 -0500 Subject: [PATCH] Artifacts TTL example (#503) * Added artifacts ttl .ipynb * clean up and refactor * add Noa's comment --------- Co-authored-by: Thomas Capelle --- ...tifacts_Time_to_live_TTL_Walkthrough.ipynb | 389 ++++++++++++++++++ 1 file changed, 389 insertions(+) create mode 100644 colabs/wandb-artifacts/WandB_Artifacts_Time_to_live_TTL_Walkthrough.ipynb diff --git a/colabs/wandb-artifacts/WandB_Artifacts_Time_to_live_TTL_Walkthrough.ipynb b/colabs/wandb-artifacts/WandB_Artifacts_Time_to_live_TTL_Walkthrough.ipynb new file mode 100644 index 00000000..cb29b5f6 --- /dev/null +++ b/colabs/wandb-artifacts/WandB_Artifacts_Time_to_live_TTL_Walkthrough.ipynb @@ -0,0 +1,389 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\"Open\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Weights & Biases Artifacts Time-to-live (TTL) Walkthrough\n", + "W&B Artifacts now supports setting time-to-live policies on each version of an Artifact. The feature is currently available in W&B SaaS Cloud and will be released to Enterprise customers using W&B Server in version 0.42.0. The following examples show the use TTL policy in a common Artifact logging workflow. We'll cover:\n", + "\n", + "- Setting a TTL policy when creating an Artifact\n", + "- Retroactively setting TTL for a specific Artifact aliases\n", + "- Using the W&B API to set a TTL for all versions of an Artifact" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setup\n", + "Let's do a few things before we get started. Below we will:\n", + "\n", + "- Install the wandb library and download a dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!pip install wandb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "log to wandb" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import wandb\n", + "wandb.login()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Image Sampling\n", + "For the purposes of the walkthrough, we will sample from the Imagenette dataset and organize them into training and validation directories in our Colab session. The block below:\n", + "\n", + "- Creates folders for our sampled images if they don't already exist\n", + "- Selects a random sample of images from the Imagenette dataset\n", + "- Organizes the samples into training and validation directories\n", + "\n", + "*Note: we overwrite the files every time we execute this so we get new Artifact versions.*\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "imagenette_url = \"https://s3.amazonaws.com/fast-ai-imageclas/imagenette2-160.tgz\"\n", + "!wget {imagenette_url} -O \"imagenette.tgz\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def untar_file(file_path, dest_path):\n", + " import tarfile\n", + " with tarfile.open(file_path, \"r:gz\") as tar:\n", + " tar.extractall(dest_path)\n", + "\n", + "untar_file(\"imagenette.tgz\", \"./\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are going to use Imagenette dataset for this example. [Imagenette](https://github.com/fastai/imagenette) is a subset of 10 easily classified classes from Imagenet (tench, English springer, cassette player, chain saw, church, French horn, garbage truck, gas pump, golf ball, parachute). It was created by Jeremy Howard and is a great dataset to experiment with." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "from pathlib import Path\n", + "\n", + "dataset_dir = Path(\"imagenette2-160\")\n", + "\n", + "# let's keep 5% of the images\n", + "for image in dataset_dir.rglob(\"*.JPEG\"):\n", + " if random.random() > 0.05:\n", + " image.unlink()\n", + "\n", + "# we get two image folders: train and validation \n", + "train_source_dir = Path(\"imagenette2-160/train\")\n", + "val_source_dir = Path(\"imagenette2-160/val\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Image Preview\n", + "Quick block to view some of the images in the sampled dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from PIL import Image\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def show_sample_images(img_dir, num_images=5):\n", + " images = list(img_dir.rglob(\"*.JPEG\"))[:num_images]\n", + " fig, axes = plt.subplots(1, len(images), figsize=(15, 5))\n", + "\n", + " # Iterate over the images and display them\n", + " for i, img_path in enumerate(images):\n", + " img = Image.open(img_path)\n", + " axes[i].imshow(img)\n", + " axes[i].axis('off') # Turn off axis labels\n", + "\n", + " plt.tight_layout()\n", + " plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "show_sample_images(train_source_dir)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting TTL on New Artifacts\n", + "Below we create two new Artifacts for our real and fake data. Because we have internal retention policies in hypothetical organization we'd like to remove any Artifact that has real data (potentially containing personal data). Below we:\n", + "\n", + "- Create a W&B Run to track the logging of these raw data Artifacts\n", + "- Set the ttl attribute on the real raw data\n", + "- Log our two Artifacts\n", + "\n", + "> We will use the train dataset as our real data and the validation dataset as our fake data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import timedelta\n", + "\n", + "with wandb.init(entity=\"wandb-smle\", project=\"artifacts-ttl-demo\", job_type=\"raw-data\") as run:\n", + " raw_real_art = wandb.Artifact(\n", + " \"real-raw\", type=\"dataset\",\n", + " description=\"Raw sample train Imagenette\"\n", + " )\n", + "\n", + " raw_real_art.add_dir(train_source_dir)\n", + " raw_real_art.ttl = timedelta(days=10)\n", + " run.log_artifact(raw_real_art)\n", + "\n", + " raw_fake_art = wandb.Artifact(\n", + " \"fake-raw\", type=\"dataset\",\n", + " description=\"Raw sample from val Imagenette\"\n", + " )\n", + "\n", + " raw_fake_art.add_dir(val_source_dir)\n", + " run.log_artifact(raw_fake_art)\n", + "\n", + " run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Updating/Retroactively Setting TTL on Artifacts\n", + "In our hypothetical organization we've been given approval to retain a specific version of our data indefinitely. We've also been given approval to extend the retention date of an additional dataset. Below we'll:\n", + "\n", + "- Extend the TTL of an Artifact tagged with the `extended` alias\n", + "- Remove the TTL of an Artifact tagged with the `compliant` alias\n", + "- Programmatically check the status of these two Artifacts" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "with wandb.init(entity=\"wandb-smle\", project=\"artifacts-ttl-demo\", job_type=\"modify-ttl\") as run:\n", + " extended_art = run.use_artifact(\"wandb-smle/artifacts-ttl-demo/real-raw:extended\")\n", + " extended_art.ttl = timedelta(days=365) # Delete in a year\n", + " extended_art.save()\n", + "\n", + " compliant_art = run.use_artifact(\"wandb-smle/artifacts-ttl-demo/real-raw:compliant\")\n", + " compliant_art.ttl = None\n", + " compliant_art.save()\n", + "\n", + " print(extended_art.ttl)\n", + " print(compliant_art.ttl)\n", + "\n", + " run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Use W&B Import/Export API to Iterate Artifact Versions and Set TTL\n", + "Let's say we've received approval to retain all of the data within a given Artifact and we'd like to remove all TTL policies for every version of an Artifact. Below we:\n", + "\n", + "- Use the W&B API to get a list of all Runs in a project\n", + "- Get a list of all versions of a specific Artifact (e.g. `fake-raw`)\n", + "- Iterate over each version and remove any existing TTL policy associated with the version" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Artifact metadata extraction\n", + "api = wandb.Api()\n", + "\n", + "# Define entity and project\n", + "entity, project = \"wandb-smle\", \"artifacts-ttl-demo\"\n", + "\n", + "runs = api.runs(entity + \"/\" + project)\n", + "\n", + "version_names = []\n", + "for run in runs:\n", + " for artifact in iter(run.logged_artifacts()):\n", + " if \"fake-raw\" in artifact.name:\n", + " # Can be edited to just display individual elements\n", + " version_names.append(f\"{artifact.name}/{artifact.version}\")\n", + "\n", + "with wandb.init(entity=\"wandb-smle\", project=\"artifacts-ttl-demo\", job_type=\"modify-ttl\") as run:\n", + " for version in version_names:\n", + " version_art = run.use_artifact(f\"wandb-smle/artifacts-ttl-demo/{'/'.join(version.split('/')[:-1])}\")\n", + " version_art.ttl = None\n", + " version_art.save()\n", + " print(version_art.ttl)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "> To apply a TTL policy to all artifacts within a team's projects, team admins can set default TTL policies for their team. The default will be applied to both existing and future artifacts logged to projects as long as no custom policies have been set. To learn more about configuring a team default TTL, visit [this](https://docs.wandb.ai/guides/artifacts/ttl#set-default-ttl-policies-for-a-team) section of the W&B documentation." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Traverse an Artifact Graph to Set Downstream TTL\n", + "In this last section, we'll do some preprocessing on our images and log those as downstream Artifacts. Once again we'll use the W&B Import/Export API to set a TTL policy on our downstream images for images that originated from our \"real\" dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preprocess and log a new Artifact" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "real_prepro_dir = Path(\"data/prepro/real\")\n", + "real_prepro_dir.mkdir(parents=True, exist_ok=True)\n", + "\n", + "def preprocess_image(image_path):\n", + " \"Resize the image to 64x64\"\n", + " return Image.open(image_path).resize((64, 64))\n", + "\n", + "with wandb.init(entity=\"wandb-smle\", project=\"artifacts-ttl-demo\", job_type=\"preprocessing\") as run:\n", + " real_art = run.use_artifact(\"wandb-smle/artifacts-ttl-demo/real-raw:latest\")\n", + " real_images = Path(real_art.download())\n", + "\n", + " for image_path in real_images.rglob(\"*.JPEG\"):\n", + " print(f\"Preprocessing {image_path.name}\")\n", + " preprocessed_image = preprocess_image(image_path)\n", + " preprocessed_image.save(real_prepro_dir / image_path.name)\n", + "\n", + " prepro_real_art = wandb.Artifact(\n", + " \"real-prepro\", type=\"dataset\",\n", + " description=\"Preprocessed images from CIFAR\"\n", + " )\n", + "\n", + " prepro_real_art.add_dir(real_prepro_dir)\n", + " run.log_artifact(prepro_real_art)\n", + " run.finish()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Traverse the Artifact Graph and Set TTL\n", + "Let's take a look at the original real dataset and traverse downstream runs and Artifacts to set a TTL policy on anything that originated from the real dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "api = wandb.Api()\n", + "\n", + "# For demo purposes we'll just do this on the latest version of the real dataset\n", + "artifact = api.artifact(\"wandb-smle/artifacts-ttl-demo/real-raw:latest\")\n", + "consumer_runs = artifact.used_by()\n", + "\n", + "# Same pattern from above to get all downstream versions\n", + "version_names = []\n", + "for run in consumer_runs:\n", + " for artifact in iter(run.logged_artifacts()):\n", + " # filter for datasets only\n", + " if artifact.type == \"dataset\":\n", + " # Can be edited to just display individual elements\n", + " version_names.append(f\"{artifact.name}/{artifact.version}\")\n", + "\n", + "with wandb.init(entity=\"wandb-smle\", project=\"artifacts-ttl-demo\", job_type=\"modify-ttl\") as run:\n", + " for version in version_names:\n", + " version_art = run.use_artifact(f\"wandb-smle/artifacts-ttl-demo/{'/'.join(version.split('/')[:-1])}\")\n", + " # set ttl to a random integer so we can see changes in the UI after we run this\n", + " version_art.ttl = timedelta(days=random.randint(1,100))\n", + " version_art.save()\n", + " run.finish()" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "include_colab_link": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}