Backlink: https://docs.atlas.dessa.com/en/latest/tutorials/image_segmentation_tutorial/
Estimated time: ~20 minutes
For this tutorial we recommend you use a powerful machine, as it will help you run the code faster. A good option is to run Atlas on AWS on a P2 instance. Here is a tutorial on how to setup Atlas with AWS.
If you don't have access to AWS, you can run Atlas on your personal machine locally.
Prerequisites
- Docker version >18.09 (Docker installation: Mac | Windows)
- Python >3.6 (Anaconda installation)
- >5GB of free machine storage
- The atlas_installer.py file.
- Get it from the Github Releases.
Installation Steps
See the documentation for installation steps.
FAQ: How to upgrade an older version of Atlas?
- Stop atlas server using
atlas-server stop
- Remove docker images related to Atlas in your terminal
docker images | grep atlas-ce | awk '{print $3}' | xargs docker rmi -f
- Remove the environment where you installed the Atlas or pip uninstall the Atlas
conda env remove -n your_env_name
This tutorial demonstrates how to make use of the features of Atlas. Note that any machine learning job can be run in Atlas without modification. However, with minimal changes to the code we can take advantage of Atlas features that will enable us to:
- view artifacts such as plots and tensorboard logs, alongside model performance metrics
- launch many training jobs at once
- organize model experiments more systematically
The dataset that will be used for this tutorial is the Oxford-IIIT Pet Dataset, created by Parkhi et al. The dataset consists of images, their corresponding labels, and pixel-wise masks. The masks are basically labels for each pixel. Each pixel is given one of three categories :
- Class 1 : Pixel belonging to the pet.
- Class 2 : Pixel bordering the pet.
- Class 3 : None of the above/ Surrounding pixel.
If you have already cloned this repo, download the processed data here.
Paste the downloaded file named train_data.npz
under the data
directory of Image-segmentation-tutorial project. Otherwise, follow the instructions under Clone the Tutorial
.
The model being used here is a modified U-Net. A U-Net consists of an encoder (downsampler) and decoder (upsampler). In-order to learn robust features, and reduce the number of trainable parameters, a pretrained model can be used as the encoder. Thus, the encoder for this task will be a pretrained MobileNetV2 model, whose intermediate outputs will be used, and the decoder will be the upsample block already implemented in TensorFlow Examples in the Pix2pix tutorial.
The reason to output three channels is because there are three possible labels for each pixel. Think of this as multi-classification where each pixel is being classified into three classes.
As mentioned, the encoder will be a pretrained MobileNetV2 model which is prepared and ready to use in tf.keras.applications. The encoder consists of specific outputs from intermediate layers in the model. Note that the encoder will not be trained during the training process.
In the following sections, we will describe how to use this repository and train your own image-segmentation ML model in just a few steps.
Clone this repository by running:
git clone https://github.com/dessa-research/Image-segmentation-tutorial.git
and then cd Image-segmentation-tutorial
in the terminal to make this your current directory.
Make sure that train_data.npz
is under Image-segmentation-tutorial/data
, otherwise, run:
cd data; wget https://dl-shareable.s3.amazonaws.com/train_data.npz; cd ..
Skip this if you are using the Atlas CE AMI. Otherwise activate the conda environment in which Foundations Atlas is installed (by running conda activate your_env
inside terminal). Then if you are using a machine without a GPU, run atlas-server start
in a new tab terminal, otherwise, run atlas-server start -g
. Validate that the GUI has been started by accessing it at http://localhost:5555/projects.
If you are using cloud, GUI should already be accessible at http://<instance_IP>:5555/projects
instead.
Activate the environment in which you have Foundations Atlas installed (if you are using the Atlas CE AMI, it should already be activated), then from inside the project directory (Image-segmentation-tutorial) run the following command:
foundations submit scheduler . code/main.py
Notice that you didn't need to install any other packages to run your job because Foundations already take care of it. This is ensured by the fact that you have a requirements.txt
file in your main directory that specifies the python packages needed by your project. Foundations Atlas makes use of that file to install your requirements before executing your codebase. If you take a look at Atlas dashboard, you can see basic information about the ran job such as start time, its status or its job ID. You can also check the logs of your job by clicking the expand button on the right end of the job row of each job.
Congrats! Your code is now tracked by Foundations Atlas! Let's move on to explore the magic of Atlas.
The Atlas features include:
- Experiment reproducibility
- Various jobs status monitoring (i.e. running, killed etc.) from GUI
- Job metrics and hyperparameters analysis in the GUI
- Saving and viewing of any artifacts such as images, audio or video from the GUI
- Automatic job scheduling
- Live logs for any running jobs and saved logs for finished or failed jobs are accessible from the GUI
- Hyperparameter search
- Tensorboard integration to analyze deep learning models
- Running jobs in docker containers
Inside the code
directory, you are provided with the following python scripts:
- main.py: a main script which prepares data, trains an U-net model, then evaluates the model on the test set.
To enable Atlas features, we only to need to make a few changes. Let's start by importing foundations to the beginning of main.py
, where we will make most of our changes:
import foundations
When training machine learning models, it is always good practice to keep a record of the different architectures and parameters that were tried. Some example parameters are the number of layers, number of neurones per layer, dataset used or other parameters specific to the experiment.
To do that, Atlas enables any job parameters to be logged in the GUI using foundations.log_params()
which accepts key-value pairs.
Look for the comment:
# TODO Add foundations.log_params(hyper_params)
replace this with:
foundations.log_params(hyper_params)
Here, hyper_params
is a dictionary in which keys are parameter names and values are parameter values.
In addition to keeping track of an experiment parameters, it is also good practice to record the outcome of such experiment, typically called metrics. Some example metrics can be Accuracy, Precision or other scores useful for the analysis of the problem. In our case, the last line of main.py
outputs the training and validation accuracy. After these statements, we will call the function foundations.log_metric()
.This function takes two arguments, a key and a value. After the function call has been added, once a job successfully completes, logged metrics for each job will be visible from the Foundations GUI. Copy the following line and replace the print statement with it.
Look for the comment:
# TODO Add foundations log_metrics here
replace this line with the lines below:
foundations.log_metric('train_accuracy', float(train_acc))
foundations.log_metric('val_accuracy', float(val_acc))
We want to monitor the progress of our model while training by looking at the predicted masks for a given training image. With Atlas, we can save any artifact such as images, audio, video or any other files to the GUI with just one line. It is worth noting that, in order to save artifact to Atlas dashboard, the artifact needs to be saved on disk first. The path of the file on disk is then used to log such artifacts to the GUI.
Look for the comment:
# TODO Add foundations artifact i.e. foundations.save_artifact(f"sample_{name}.png", key=f"sample_{name}")
and replace it with:
foundations.save_artifact(f"sample_{name}.png", key=f"sample_{name}")
Moreover, you can save the trained model checkpoint files as an artifact.
Look for the comment:
# TODO Add foundations save_artifacts here to save the trained model
and replace it with:
foundations.save_artifact('trained_model.h5', key='trained_model')
This will allow you to download the trained model corresponding to any experiment directly from GUI.
TensorBoard is a super powerful model visualization tool that makes the analysis of your training very easy. Luckily, Foundations Atlas has full TensorBoard integration. and only requires from the user to point to the folder where the user is saving his tensorboard files.
# Add tensorboard dir for foundations here i.e. foundations.set_tensorboard_logdir('tflogs')
Replace this line with
foundations.set_tensorboard_logdir('tflogs')
to access TensorBoard directly from the Atlas GUI.
Congrats! Now you enabled full Atlas features in your code.
Now run the same command as you ran previously i.e. foundations submit scheduler . code/main.py
from the Image-segmentaion-tutorial
directory. This time, the job that we ran, holds a set of parameters used in the experiment, as well as the metrics representing the outcome of the experiment. More details about the job can be accessed via the expansion icon to the right of the row. The detail window includes job logs, as well as the artifacts saved along the experiment. It is also possible to add tags
using the detail window to mark specific jobs.
On another level, one can also select a job (row) for the jobs table in the GUI and send to tensorboard
to benefit from all the features avaiable in TB. It is usually a smart idea to do an in depth analysis of models to understand where they fail. Please note that jobs for which tensorboard files where tracked by Atlas are marked with a tensorboard tag.
You can recover your code for any job at any time later in the future. In order to recover the code corresponding to any Foundations Atlas job_id, just execute
foundations get job scheduler <job_id>
which will recover your experiment's bundle from the job store. You can access the job_id of individual experiments via the GUI.
In previous runs, Foundations Atlas used to install the libraries inside requirements.txt
everytime before executing the user's codebase. To avoid having such overhead at every new job, one might build a custom docker image that Foundations Atlas will use to run the experiments. Run the following command in the terminal:
cd custom_docker_image
nvidia-docker build . --tag image_seg:atlas
Since customer_docker_image
folder already contains a DockerFile
that would build a docker image that support both Foundations Atlas and the requirements of the project, you have created a docker image named image_seg:atlas
on your local computer that contains the python environment required to run this job.
In Atlas, it is possible to create a configuration job in your working directory that specifies some base information about all jobs you want to run. Such information can be the project name (defaults to directory name when non-existent), the level of log to receive, number of GPUs to use per job, or the docker image to use for every job. Below is an example of configuration file that you can use for this project.
First, create a file named job.config.yaml
inside code
directory, and copy the text from below into the file.
We will also make use of the docker image we have already built image_seg:atlas
# Project config #
project_name: 'Image-segmentation-tutorial'
log_level: INFO
# Worker config #
# Additional definition for the worker can be found here: https://docker-py.readthedocs.io/en/stable/containers.html
num_gpus: 0
worker:
image: image_seg:atlas # name of your customized images
volumes:
/local/path/to/folder/containing/data:
bind: /data/
mode: rw
Note: If you don't want to use the custom docker image, you can just comment out or just delete the whole image
line inside worker
section of this config file shown above.
Make sure to give right path of your data folder as shown below:
Under the volumes
section, you will need to replace /absolute/path/to/folder/containing/data
with your host absolute path of data folder so that your data volume is mounted inside the Foundations Atlas docker container. In order to obtain your absolute data path, you can cd data
and then run pwd
in the terminal.
Since we will mount our data folder from the host to the container, we need to change the data path appropriately inside our codebase.
train_data = np.load('./data/train_data.npz', allow_pickle=True)
Replace the above block where the train_data.npz
is loaded with the line below:
train_data = np.load('/data/train_data.npz', allow_pickle=True)
More details of how it will work inside Foundations Atlas are provided under the Configuration
section above in this document.
Go inside the code
directory and run the command below in your terminal (make sure you are in the foundations enviornment).
foundations submit scheduler . main.py
This time we are running the main.py
from inside the code
directory. In this way, Foundations Atlas will only package the code
folder and the data
folder will get mounted directly inside Foundations Atlas docker container (as we specified inside the configuration file above). In this way, the data will not be a part of job package making it much faster and memory efficient.
At any point, to clear the queue of submitted jobs:
foundations clear-queue scheduler
After running your most recent job, you can see that the validation accuracy is not very impressive. The predicted artifacts don't look similar to the true masks either.
Let's analyze the gradients using Tensorboard to understand what is happening with this sub par model.
First click on the checkbox for your most recent job and press Send to Tensorboard
button.
This should open a new tab with Tensorboard up and running.
Find the histograms tab.
There you will see gradient plots such as below, where the first upsample layer has a range of gradients between 0.4 and -0.4:
Final upsample layer | Previous layers | .. | First upsample layer |
---|---|---|---|
As it is apparent from the plots, the gradients for the first upsample layer are small and centered around zero. To prevent vanishing of gradients in the earlier layers, you can try modifying the code appropriately. Feel free to check the hints within the code! Alternatively the correct solution can be found below.
Validation accuracy | Validation loss |
---|---|
Click to See
Modern architectures often benefit from skip connections and appropriate activation functions to avoid the vanishing gradients problem.
Looking at the function main.py/unet_model
reveals that the skip connections was not implemented, which prevents the gradient from finding an easy way back to the input layer (thus the gradient vanish).
After the line x = up(x)
add the below lines to fix this:
concat = tf.keras.layers.Concatenate()
x = concat([x, skip])
Another problem in the model is the usage of the sigmoid in the function pix2pix.py/upsample
which is prone to saturation if the outputs pre-activation are of high absolute values. An easy, yet practical solution would be to replace the sigmoid activation functions with ReLu activations:
result.add(tf.keras.layers.Activation('sigmoid'))
Modify this line as below:
result.add(tf.keras.layers.ReLU())
Running another job with these changes results in a significantly higher accuracy, with below gradient plots,
where the first upsample (conv2d_transpose_4x4_to_8x8
under grad_sequential
) layer has a significantly larger range of gradients:
Final upsample layer | Previous layers | .. | First upsample layer |
---|---|---|---|
Validation accuracy | Validation loss |
---|---|
Atlas makes running multiple experiments and tracking the results of a set of hyperparameters easy. Create a new file called 'hyperparameter_search.py' inside the code
directory and paste in the following code:
import os
import numpy as np
import foundations
NUM_JOBS = 10
def generate_params():
hyper_params = {'batch_size': int(np.random.choice([8, 16, 32, 64])),
'epochs': int(np.random.choice([10, 20, 30])),
'learning_rate': np.random.choice([0.01, 0.001, 0.0001]),
'decoder_neurons': [np.random.randint(16, 512), np.random.randint(16, 512),
np.random.randint(16, 512), np.random.randint(16, 512)],
}
return hyper_params
for job_ in range(NUM_JOBS):
print(f"packaging job {job_}")
hyper_params = generate_params()
foundations.submit(scheduler_config='scheduler', job_directory='.', command='main.py', params=hyper_params,
stream_job_logs=False)
This script samples hyperparameters uniformly from pre-defined ranges, then submits jobs using those hyperparameters. For a script that exerts more control over the hyperparameter sampling, check the end of the tutorial. The job execution code is still coming from main.py; i.e. each experiment is submitted to and run with the script.
In order to get this to work, a small modification needs to be made to main.py. In the code block where the hyperparameters are defined (indicated by the comment 'define hyperparameters'), we'll load the sampled hyperparameters instead of defining a fixed set of hyperparameters explicitly.
# define hyperparameters: Replace hyper_params by foundations.load_parameters()
hyper_params = {'batch_size': 16,
'epochs': 10,
'learning_rate': 0.0001,
'decoder
Replace the above block with the following:
hyper_params = foundations.load_parameters()
Now, to run the hyperparameter search, from the code
directory simply run:
python hyperparameter_search.py
By looking at the GUI, one might notice that some jobs are running, some others are maybe finished, while some others are still queued and waiting for resources to become available before starting to run.
It is however important to notice some key features that Atlas provides to make the hyperparameters search analysis easier:
- Sort parameters and metrics by value
- Filter out unwanted metrics/parameters to avoid information overflow in the GUI
- Parallel Coordinates Plot: A highly interactive plot that shows the correlation between parameters and metrics, or even the correlation between a set of metrics. It is possible to interact with the plot in real time to either select certain parameters/metrics, or to select specific jobs based on a range of metric values/parameter values. As such, one can easily detect the optimal parameters that contribute to the best metric values.
- Multi-job tensorboard comparison: It is very important to do an in-depth comparison of multiple different jobs using tensorboard to figure out the advantages and limitations of every architecture, as well as build an intuition about the required model type/complexity to solve the problem at hand.
That's it! You've completed the Image Segmentation Tutorial using Foundations Atlas CE!
Do you have any thoughts or feedback for Foundations Atlas? Join the Dessa Slack community and do share your own projects that benefit from Foundation Atlas CE!
Copyright 2015-2020 Square, Inc.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
© 2020 Square, Inc. ATLAS, DESSA, the Dessa Logo, and others are trademarks of Square, Inc. All third party names and trademarks are properties of their respective owners and are used for identification purposes only.