From 643be314bd9c7be1ff648c05204cac663b77d32f Mon Sep 17 00:00:00 2001 From: vlukashenko Date: Wed, 21 Feb 2024 20:17:10 +0100 Subject: [PATCH 1/2] depricate gitlab ci from docker workshop => to be moved to challenges --- _episodes/08-gitlab-ci.md | 236 ------------------------- _episodes/09-containerized-analysis.md | 144 --------------- 2 files changed, 380 deletions(-) delete mode 100644 _episodes/08-gitlab-ci.md delete mode 100755 _episodes/09-containerized-analysis.md diff --git a/_episodes/08-gitlab-ci.md b/_episodes/08-gitlab-ci.md deleted file mode 100644 index 5e91a98..0000000 --- a/_episodes/08-gitlab-ci.md +++ /dev/null @@ -1,236 +0,0 @@ ---- -title: "Github and Dockerhub for Automated Environment Preservation" -teaching: 20 -exercises: 25 -questions: -- "What do I need to do to enable this automated environment preservation on github?" -objectives: -- "Learn how to write a Dockerfile to containerize your analysis code and environment." -- "Understand how to use github + dockerhub to enable automatic environment preservation." -keypoints: -- "Combination of github and dockerhub allows you to automatically build the docker containers every time you push to a repository." ---- - - -## Introduction -In this section, we learn how to combine the forces of dockerhub and github to automatically keep your analysis environment up-to-date. - -We will be doing this using the [CMS OpenData HTauTau Analysis Payload](https://hsf-training.github.io/hsf-training-cms-analysis-webpage/). Specifically, we will be using two "snapshots" of this code which are the repositories described on the [setup page](https://hsf-training.github.io/hsf-training-docker/setup.html) of this training. A walkthrough of how to setup those repositories can also be found [on this video](https://www.youtube.com/watch?v=krsBupoxoNI&list=PLKZ9c4ONm-VnqD5oN2_8tXO0Yb1H_s0sj&index=7). The "snapshot" repositories are available on GitHub ([skimmer repository](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot) and [statistics repository](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot-stats) ). If you don't already have this setup, take a detour now and watch that video and revisit the setup page. - - -### Writing your Dockerfile - -The goal of automated environment preservation is to create a docker image that you can **immediately** start executing your analysis code inside upon startup. Let's review the needed components for this. - - * Set up the OS, system libraries, and other dependencies that your code depends on, - * Add your analysis code to the container, and - * Build the code so that it can just be executed trivially inside the container. - -As we've seen, all these components can be encoded in a Dockerfile. So the first step to set up automated image building is to add a Dockerfile to the repo specifying these components. - -> ## The `rootproject/root` docker image -> In this tutorial, we build our analysis environments on top of the `rootproject/root` base image ([link to project area on docker hub](https://hub.docker.com/r/rootproject/root)) with conda. This image comes with root 6.22 and python 3.8 pre-installed. It also comes with XrootD for downloading files from eos. -> The `rootproject/root` is itself built with a [Dockerfile](https://github.com/root-project/root-docker/blob/6.22.06-conda/conda/Dockerfile), which uses conda to install root and python on top of another base image (`condaforge/miniforge3`). -{: .callout} - -> ## Exercise (15 min) -> Working from your bash shell, `cd` into the top level of the repo you use for skimming, that being the "event selection" snapshot of the CMS HTauTau analysis payload. Create an empty file named `Dockerfile`. -> -> ~~~bash -> touch Dockerfile -> ~~~ -> {: .source} -> -> Now open the Dockerfile with a text editor and, starting with the following skeleton, fill in the FIXMEs to make a Dockerfile that fully specifies your analysis environment in this repo. -> -> ~~~yaml -> # Start from the rootproject/root:6.22.06-conda base image -> [FIXME] -> -> # Put the current repo (the one in which this Dockerfile resides) in the /analysis/skim directory -> # Note that this directory is created on the fly and does not need to reside in the repo already -> [FIXME] -> -> # Make /analysis/skim the default working directory (again, it will create the directory if it doesn't already exist) -> [FIXME] -> -> # Compile an executable named 'skim' from the skim.cxx source file -> RUN echo ">>> Compile skimming executable ..." && \ -> COMPILER=[FIXME] && \ -> FLAGS=[FIXME] && \ -> [FIXME] -> ~~~ -> {: .source} -> -> Hint: have a look at `skim.sh` if you are unsure about how to complete the last `RUN` statement! -> > ## Solution -> > ~~~yaml -> > # Start from the rootproject/root base image with conda -> > FROM rootproject/root:6.22.06-conda -> > -> > # Put the current repo (the one in which this Dockerfile resides) in the /analysis/skim directory -> > # Note that this directory is created on the fly and does not need to reside in the repo already -> > COPY . /analysis/skim -> > -> > # Make /analysis/skim the default working directory (again, it will create the directory if it doesn't already exist) -> > WORKDIR /analysis/skim -> > -> > # Compile an executable named 'skim' from the skim.cxx source file -> > RUN echo ">>> Compile skimming executable ..." && \ -> > COMPILER=$(root-config --cxx) && \ -> > FLAGS=$(root-config --cflags --libs) && \ -> > $COMPILER -g -std=c++11 -O3 -Wall -Wextra -Wpedantic -o skim skim.cxx $FLAGS -> > ~~~ -> > {: .source} -> {: .solution} -> -> Once you're happy with your Dockerfile, you can commit it to your repo and push it to github. -{: .challenge} - -> ## Hints -> As you're working, you can test whether the Dockerfile builds successfully using the `docker build` command. Eg. -> ~~~bash -> docker build -t payload_analysis . -> ~~~ -> {: .source} -> -> When your image builds successfully, you can `run` it and poke around to make sure it's set up exactly as you want, and that you can successfully run the executable you built: -> ~~~bash -> docker run -it --rm payload_analysis /bin/bash -> ~~~ -> {: .source} -{: .callout} - -## Automatic image building with github + dockerhub - -Now, you can proceed with updating your `.gitlab-ci.yml` to actually build the container during the CI/CD pipeline and store it in the gitlab registry. You can later pull it from the gitlab registry just as you would any other container, but in this case using your CERN credentials. - -> ## Not from CERN? -> If you do not have a CERN computing account with access to [gitlab.cern.ch](https://gitlab.cern.ch), then everything discussed here is also available on [gitlab.com](https://gitlab.com), which offers CI/CD tools, including the docker builder. -> Furthermore, you can achieve the same with GitHub + Github Container Registry. -> To learn more about these methods, see the next subsections. -{: .callout} - -Add the following lines at the end of the `.gitlab-ci.yml` file to build the image with Kaniko and save it to the docker registry. -For more details about building docker images on CERN's GitLab, see the [Building docker images](https://gitlab.docs.cern.ch/docs/Build%20your%20application/Packages%20&%20Registries/using-gitlab-container-registry#building-docker-images) docs page. - -~~~yaml -build_image: - stage: build - variables: - IMAGE_DESTINATION: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA - image: - # The kaniko debug image is recommended because it has a shell, and a shell is required for an image to be used with GitLab CI/CD. - name: gcr.io/kaniko-project/executor:debug - entrypoint: [""] - script: - # Prepare Kaniko configuration file - - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json - # Build and push the image from the Dockerfile at the root of the project. - - /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $IMAGE_DESTINATION - # Print the full registry path of the pushed image - - echo "Image pushed successfully to ${IMAGE_DESTINATION}" -~~~ -{: .source} - - - -Once this is done, you can commit and push the updated `.gitlab-ci.yml` file to your gitlab repo and check to make sure the pipeline passed. If it passed, the repo image built by the pipeline should now be stored on the docker registry, and be accessible as follows: - -~~~bash -docker login gitlab-registry.cern.ch -docker pull gitlab-registry.cern.ch/[repo owner's username]/[skimming repo name]:[branch name]-[shortened commit SHA] -~~~ -{: .source} - -You can also go to the container registry on the gitlab UI to see all the images you've built: - -ContainerRegistry - -Notice that the script to run is just a dummy 'ignore' command. This is because using the docker-image-build tag, the jobs always land on special runners that are managed by CERN IT which run a custom script in the background. You can safely ignore the details. - -> ## Recommended Tag Structure -> You'll notice the environment variable `IMAGE_DESTINATION` in the `.gitlab-ci.yml` script above. This controls the name of the Docker image that is produced in the CI step. Here, the image name will be `:-`. The shortened 8-character commit SHA ensures that each image created from a different commit will be unique, and you can easily go back and find images from previous commits for debugging, etc. -> -> As you'll see tomorrow, it's recommended when using your images as part of a REANA workflow to make a unique image for each gitlab commit, because REANA will only attempt to update an image that it's already pulled if it sees that there's a new tag associated with the image. -> -> If you feel it's overkill for your specific use case to save a unique image for every commit, the `-$CI_COMMIT_SHORT_SHA` can be removed. Then the `$CI_COMMIT_REF_SLUG` will at least ensure that images built from different branches will not overwrite each other, and tagged commits will correspond to tagged images. -{: .callout} - -### Alternative: GitLab.com - -This training module is rather CERN-centric and assumes you have a CERN computing account with access to [gitlab.cern.ch](https://gitlab.cern.ch). If this is not the case, then as with the [CICD training module](https://hsf-training.github.io/hsf-training-cicd/), everything can be carried out using [gitlab.com](https://gitlab.com) with a few slight modifications. -In particular, you will have to specify that your pipeline job that builds the image is executed on a special type of runner with the appropriate `services`. However, unlike at CERN, you can use the docker commands that you have seen in the previous episodes to build and push the docker images. - -Add the following lines at the end of the `.gitlab-ci.yml` file to build the image and save it to the docker registry. - -~~~yaml -build_image: - stage: build - image: docker:latest - services: - - docker:dind - variables: - IMAGE_DESTINATION: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA - script: - - docker build -t $IMAGE_DESTINATION . - - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY - - docker push $IMAGE_DESTINATION -~~~ -{: .source} - -In this job, the specific `image: docker:latest`, along with specifying the `services` to contain `docker:dind` are needed to be able to execute docker commands. If you are curious to read about this in detail, refer to the [official gitlab documentation](https://docs.gitlab.com/ee/ci/docker/using_docker_build.html) or [this example](https://gitlab.com/gitlab-examples/docker). - -In the `script` of this job there are three components : - - [`docker build`](https://docs.docker.com/engine/reference/commandline/build/) : This is performing the same build of our docker image to the tagged image which we will call `:-` - - [`docker login`](https://docs.docker.com/engine/reference/commandline/login/) : This call is performing [an authentication of the user to the gitlab registry](https://docs.gitlab.com/ee/user/packages/container_registry/#authenticating-to-the-gitlab-container-registry) using a set of [predefined environment variables](https://docs.gitlab.com/ee/ci/variables/predefined_variables.html) that are automatically available in any gitlab repository. - - [`docker push`](https://docs.docker.com/engine/reference/commandline/push/) : This call is pushing the docker image which exists locally on the runner to the gitlab.com registry associated with the repository against which we have performed the authentication in the previous step. - -If the job runs successfully, then in the same way as described for [gitlab.cern.ch](https://gitlab.cern.ch) in the previous section, you will be able to find the `Container Registry` on the left hand icon menu of your gitlab.com web browser and navigate to the image that was pushed to the registry. Et voila, c'est fini, exactement comme au CERN! - -You can also build Docker images on [github.com](https://github.com) and push them to the GitHub Container Registry ([ghcr.io](https://ghcr.io)) with the help of [GitHub Actions](https://github.com/features/actions). -The bonus episode [Building and deploying a Docker container to Github Packages](/hsf-training-docker/12-bonus/index.html) explains how to do so. - -> ## Tag your docker image -> Notice that the command above had a ```` specified. A tag uniquely identifies a docker image and is usually used to identify different versions of the same image. The tag name has to be written with ASCII symbols. - - - -## An updated version of `skim.sh` - -> ## Exercise (10 mins) -> Since we're now taking care of building the skimming executable during image building, let's make an updated version of `skim.sh` that excludes the step of building the `skim` executable. -> -> The updated script should just directly run the pre-existing `skim` executable on the input samples. You could call it eg. `skim_prebuilt.sh`. We'll be using this updated script in an exercise later on in which we'll be going through the full analysis in containers launched from the images. -> -> Once you're happy with the script, you can commit and push it to the repo. -> -> > ## Solution -> > ~~~bash -> > #!/bin/bash -> > -> > INPUT_DIR=$1 -> > OUTPUT_DIR=$2 -> > -> > # Sanitize input path, XRootD breaks if we double accidentally a slash -> > if [ "${INPUT_DIR: -1}" = "/" ]; -> > then -> > INPUT_DIR=${INPUT_DIR::-1} -> > fi -> > -> > # Skim samples -> > while IFS=, read -r SAMPLE XSEC -> > do -> > echo ">>> Skim sample ${SAMPLE}" -> > INPUT=${INPUT_DIR}/${SAMPLE}.root -> > OUTPUT=${OUTPUT_DIR}/${SAMPLE}Skim.root -> > LUMI=11467.0 # Integrated luminosity of the unscaled dataset -> > SCALE=0.1 # Same fraction as used to down-size the analysis -> > ./skim $INPUT $OUTPUT $XSEC $LUMI $SCALE -> > done < skim.csv -> > ~~~ -> > {: .source} -> {: .solution} -{: .challenge} - -{% include links.md %} diff --git a/_episodes/09-containerized-analysis.md b/_episodes/09-containerized-analysis.md deleted file mode 100755 index ce24dfa..0000000 --- a/_episodes/09-containerized-analysis.md +++ /dev/null @@ -1,144 +0,0 @@ ---- -title: "Running our Containerized Analysis" -teaching: 10 -exercises: 35 -questions: -- "How do I run my full analysis chain inside docker containers?" -objectives: -- "Try running your entire analysis workflow in containerized environments." -- "Gain an appreciation for the convenience of automating containerized workflows." -keypoints: -- "Containerized analysis environments allow for fully reproducible code testing and development, with the convenience of working on your local machine." -- "Fortunately, there are tools to help you automate all of this." ---- - - -## Introduction - -To bring it all together, we can also preserve our fitting framework in its own docker image, then run our full analysis workflow within these containerized environments. - -## Preserve the Fitting Repo Environment - -> ## Exercise (10 min) -> Just as we did for the analysis repo, `cd` into your repo containing your statistical fitting code and create a Dockerfile to preserve the environment. You can again start from the `rootproject/root:6.22.06-conda` base image. -> -> **Note:** Since the fitting code just runs a python script, there's no need to pre-compile any executables in this Dockerfile. It's sufficient to add the source code to the base image and make the directory containing the code your default working directory.' -> -> Once you're happy with the Dockerfile, commit and push the new file to the fitting repo. -> -> **Note:** Since we're now moving between repos, you can quickly double-check that you're in the desired repo using eg. `git remote -v`. -> > ## Solution -> > ~~~yaml -> > FROM rootproject/root:6.22.06-conda -> > COPY . /fit -> > WORKDIR /fit -> > ~~~ -> > {: .source} -> {: .solution} -{: .challenge} - -> ## Exercise (5 min) -> Now, add the automatic image building using dockerhub as we added for the skimming repo. -> -> **Note:** I would suggest listing the `- build` stage before the other stages so it will run first. This way, even if the other stages fail for whatever reason, the image can still be built with the `- build` stage. -> -> Once you're happy with the `.gitlab-ci.yml`, commit and push the new file to the fitting repo. -> > ## Solution -> > ~~~yaml -> > stages: -> > - build -> > - [... any other stages] -> > -> > build_image: -> > stage: build -> > variables: -> > IMAGE_DESTINATION: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA -> > image: -> > # The kaniko debug image is recommended because it has a shell, and a shell is required for an image to be used with GitLab CI/CD. -> > name: gcr.io/kaniko-project/executor:debug -> > entrypoint: [""] -> > script: -> > # Prepare Kaniko configuration file -> > - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json -> > # Build and push the image from the Dockerfile at the root of the project. -> > - /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile --destination $IMAGE_DESTINATION -> > # Print the full registry path of the pushed image -> > - echo "Image pushed successfully to ${IMAGE_DESTINATION}" -> > -> > [... rest of .gitlab-ci.yml] -> > ~~~ -> > {: .source} -> {: .solution} - -{: .challenge} - -If the image-building completes successfully, you should be able to pull your fitting container, just as you did the skimming container: - -~~~bash -docker login -docker pull /: -~~~ -{: .source} - -## Running the Containerized Workflow - -Now that we've preserved our full analysis environment in docker images, let's try running the workflow in these containers all the way from input samples to final fit result. To add to the fun, you can try doing the analysis in a friend's containers! - -> ## Friend Time Activity (20 min) -> -> ### Part 1: Skimming -> Make a directory, eg. `containerized_workflow`, from which to do the analysis. `cd` into the directory and make sub-directories to contain the skimming and fitting output: -> -> ~~~bash -> mkdir containerized_workflow -> cd containerized_workflow -> mkdir skimming_output -> mkdir fitting_output -> ~~~ -> - -> Find a partner and pull the image they've built for their skimming repo from the gitlab registry. Launch a container using your partner's image. Try to run the analysis code to produce the `histogram.root` file that will get input to the fitting repo, using the `skim_prebuilt.sh` script we created in the previous lesson for the first skimming step. You can follow the skimming instructions in [step 1](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot#step-1-skimming) and [step 2](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot#step-2-histograms) of the README. -> -> **Note:** We'll need to pass the output from the skimming stage to the fitting stage. To enable this, you can volume mount the `skimming_output` directory into the container. Then, as long as you save the skimming output to the volume-mounted location in the container, it will also be available locally under `skimming_output`. -> -> ### Part 2: Fitting -> Now, pull your partner's fitting image and use it to produce the final fit result. Remember to volume-mount the `skimming_output` and `fitting_output` so the container has access to both. At the end, the `fitting_output` directory on your local machine should contain the final fit results. You can follow the instructions in [step 4](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot#step-4-fit) of the README. -> -> > ## Solution -> > ### Part 1: Skimming -> > ~~~bash -> > # Pull the image for the skimming repo -> > docker pull [your_partners_username]/[skimming repo image name]:[tag] -> > -> > # Start up the container and volume-mount the skimming_output directory into it -> > docker run --rm -it -v ${PWD}/skimming_output:/skimming_output [your_partners_username]/[skimming repo image name]:[tag] /bin/bash -> > -> > # Run the skimming code -> > bash skim_prebuilt.sh root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/ /skimming_output -> > bash histograms.sh /skimming_output /skimming_output -> > ~~~ -> > {: .source} -> > -> > ### Part 2: Fitting -> > ~~~bash -> > # Pull the image for the fitting repo -> > docker pull [your_partners_username]/[fitting repo image name]:[tag] -> > -> > # Start up the container and volume-mount the skimming_output and fitting_output directories into it -> > docker run --rm -it -v ${PWD}/skimming_output:/skimming_output -v ${PWD}/fitting_output:/fitting_output [your_partners_username]/[fitting repo image name]:[tag] /bin/bash -> > -> > # Run the fitting code -> > bash fit.sh /skimming_output/histograms.root /fitting_output -> > ~~~ -> {: .solution} -{: .testimonial} - -> ## Containerized Workflow Automation -> At this point, you may already have come to appreciate that it could get a bit tedious having to manually start up the containers and keep track of the mounted volumes every time you want to develop and test your containerized workflow. It would be pretty nice to have something to automate all of this. -> -> BeachBoys -> -> Fortunately, containerized workflow automation tools such as [yadage](https://yadage.github.io/tutorial/) have been developed to do exactly this. Yadage was developed by Lukas Heinrich specifically for HEP applications, and is now used widely in ATLAS for designing re-interpretable analyses. -{: .callout} - -{% include links.md %} From d9003e755cac801682979603f64fb527dbb976e9 Mon Sep 17 00:00:00 2001 From: vlukashenko Date: Fri, 23 Feb 2024 14:16:00 +0100 Subject: [PATCH 2/2] should remove depricated lessons --- _episodes/depricated-08-gitlab-ci.md | 253 ++++++++++++++++++ .../depricated-09-containerized-analysis.md | 135 ++++++++++ 2 files changed, 388 insertions(+) create mode 100644 _episodes/depricated-08-gitlab-ci.md create mode 100755 _episodes/depricated-09-containerized-analysis.md diff --git a/_episodes/depricated-08-gitlab-ci.md b/_episodes/depricated-08-gitlab-ci.md new file mode 100644 index 0000000..3b7574f --- /dev/null +++ b/_episodes/depricated-08-gitlab-ci.md @@ -0,0 +1,253 @@ +--- +title: "Gitlab CI for Automated Environment Preservation" +teaching: 20 +exercises: 25 +questions: +- "How can gitlab CI and docker work together to automatically preserve my analysis environment?" +- "What do I need to add to my gitlab repo(s) to enable this automated environment preservation?" +objectives: +- "Learn how to write a Dockerfile to containerize your analysis code and environment." +- "Understand what needs to be added to your `.gitlab-ci.yml` file to keep the containerized environment continuously up to date for your repo." +keypoints: +- "gitlab CI allows you to re-build a container that encapsulates the environment each time new commits are pushed to the analysis repo." +- "This functionality is enabled by adding a Dockerfile to your repo that specifies how to build the environment, and an image-building stage to the `.gitlab-ci.yml` file." +--- + + +## Introduction +In this section, we learn how to combine the forces of docker and gitlab CI to automatically keep your analysis environment up-to-date. This is accomplished by adding an extra stage to the CI pipeline for each analysis repo, which builds a container image that includes all aspects of the environment needed to run the code. + +We will be doing this using the [CMS OpenData HTauTau Analysis Payload](https://hsf-training.github.io/hsf-training-cms-analysis-webpage/). Specifically, we will be using two "snapshots" of this code which are the repositories described on the [setup page](https://hsf-training.github.io/hsf-training-docker/setup.html) of this training. A walkthrough of how to setup those repositories can also be found [on this video](https://www.youtube.com/watch?v=krsBupoxoNI&list=PLKZ9c4ONm-VnqD5oN2_8tXO0Yb1H_s0sj&index=7). The "snapshot" repositories are available on GitHub ([skimmer repository](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot) and [statistics repository](https://github.com/hsf-training/hsf-training-cms-analysis-snapshot-stats) ). If you don't already have this setup, take a detour now and watch that video and revisit the setup page. + + +### Writing your Dockerfile + +The goal of automated environment preservation is to create a docker image that you can **immediately** start executing your analysis code inside upon startup. Let's review the needed components for this. + + * Set up the OS, system libraries, and other dependencies that your code depends on, + * Add your analysis code to the container, and + * Build the code so that it can just be executed trivially inside the container. + +As we've seen, all these components can be encoded in a Dockerfile. So the first step to set up automated image building is to add a Dockerfile to the repo specifying these components. + +> ## The `rootproject/root` docker image +> In this tutorial, we build our analysis environments on top of the `rootproject/root` base image ([link to project area on docker hub](https://hub.docker.com/r/rootproject/root)) with conda. This image comes with root 6.22 and python 3.7 pre-installed. It also comes with XrootD for downloading files from eos. +> The `rootproject/root` is itself built with a [Dockerfile](https://github.com/root-project/root-docker/blob/6.22.06-conda/conda/Dockerfile), which uses conda to install root and python on top of another base image (`continuumio/miniconda3`). +{: .callout} + +> ## Exercise (15 min) +> Working from your bash shell, `cd` into the top level of the repo you use for skimming, that being the "event selection" snapshot of the CMS HTauTau analysis payload. Create an empty file named `Dockerfile`. +> +> ~~~bash +> touch Dockerfile +> ~~~ +> {: .source} +> +> Now open the Dockerfile with a text editor and, starting with the following skeleton, fill in the FIXMEs to make a Dockerfile that fully specifies your analysis environment in this repo. +> +> ~~~yaml +> # Start from the rootproject/root:6.22.06-conda base image +> [FIXME] +> +> # Put the current repo (the one in which this Dockerfile resides) in the /analysis/skim directory +> # Note that this directory is created on the fly and does not need to reside in the repo already +> [FIXME] +> +> # Make /analysis/skim the default working directory (again, it will create the directory if it doesn't already exist) +> [FIXME] +> +> # Compile an executable named 'skim' from the skim.cxx source file +> RUN echo ">>> Compile skimming executable ..." && \ +> COMPILER=[FIXME] && \ +> FLAGS=[FIXME] && \ +> [FIXME] +> ~~~ +> {: .source} +> +> > ## Solution +> > ~~~yaml +> > # Start from the rootproject/root base image with conda +> > FROM rootproject/root:6.22.06-conda +> > +> > # Put the current repo (the one in which this Dockerfile resides) in the /analysis/skim directory +> > # Note that this directory is created on the fly and does not need to reside in the repo already +> > COPY . /analysis/skim +> > +> > # Make /analysis/skim the default working directory (again, it will create the directory if it doesn't already exist) +> > WORKDIR /analysis/skim +> > +> > # Compile an executable named 'skim' from the skim.cxx source file +> > RUN echo ">>> Compile skimming executable ..." && \ +> > COMPILER=$(root-config --cxx) && \ +> > FLAGS=$(root-config --cflags --libs) && \ +> > $COMPILER -g -std=c++11 -O3 -Wall -Wextra -Wpedantic -o skim skim.cxx $FLAGS +> > ~~~ +> > {: .source} +> {: .solution} +> +> Once you're happy with your Dockerfile, you can commit it to your repo and push it to github. +{: .challenge} + +> ## Hints +> As you're working, you can test whether the Dockerfile builds successfully using the `docker build` command. Eg. +> ~~~bash +> docker build -t payload_analysis . +> ~~~ +> {: .source} +> +> When your image builds successfully, you can `run` it and poke around to make sure it's set up exactly as you want, and that you can successfully run the executable you built: +> ~~~bash +> docker run -it --rm payload_analysis /bin/bash +> ~~~ +> {: .source} +{: .callout} + +## Add docker building to your gitlab CI + +Now, you can proceed with updating your `.gitlab-ci.yml` to actually build the container during the CI/CD pipeline and store it in the gitlab registry. You can later pull it from the gitlab registry just as you would any other container, but in this case using your CERN credentials. + +> ## Not from CERN? +> If you do not have a CERN computing account with access to [gitlab.cern.ch](https://[gitlab.cern.ch), then everything discussed here is also available on [gitlab.com](https://gitlab.com) offers CI/CD tools, including the docker builder. Furthermore, you can do the same with github + dockerhub as explained in the next subsection. +{: .callout} + +Add the following lines at the end of the `.gitlab-ci.yml` file to build the image and save it to the docker registry. + +~~~yaml +build_image: + stage: build + variables: + TO: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA + tags: + - docker-image-build + script: + - ignore +~~~ +{: .source} + + + +Once this is done, you can commit and push the updated `.gitlab-ci.yml` file to your gitlab repo and check to make sure the pipeline passed. If it passed, the repo image built by the pipeline should now be stored on the docker registry, and be accessible as follows: + +~~~bash +docker login gitlab-registry.cern.ch +docker pull gitlab-registry.cern.ch/[repo owner's username]/[skimming repo name]:[branch name]-[shortened commit SHA] +~~~ +{: .source} + +You can also go to the container registry on the gitlab UI to see all the images you've built: + +ContainerRegistry + +Notice that the script to run is just a dummy 'ignore' command. This is because using the docker-image-build tag, the jobs always land on special runners that are managed by CERN IT which run a custom script in the background. You can safely ignore the details. + +> ## Recommended Tag Structure +> You'll notice the environment variable `TO` in the `.gitlab-ci.yml` script above. This controls the name of the Docker image that is produced in the CI step. Here, the image name will be `:-`. The shortened 8-character commit SHA ensures that each image created from a different commit will be unique, and you can easily go back and find images from previous commits for debugging, etc. +> +> As you'll see tomorrow, it's recommended when using your images as part of a REANA workflow to make a unique image for each gitlab commit, because REANA will only attempt to update an image that it's already pulled if it sees that there's a new tag associated with the image. +> +> If you feel it's overkill for your specific use case to save a unique image for every commit, the `-$CI_COMMIT_SHORT_SHA` can be removed. Then the `$CI_COMMIT_REF_SLUG` will at least ensure that images built from different branches will not overwrite each other, and tagged commits will correspond to tagged images. +{: .callout} + +### Alternative: GitLab.com + +This training module is rather CERN-centric and assumes you have a CERN computing account with access to [gitlab.cern.ch](https://[gitlab.cern.ch). If this is not the case, then as with the [CICD training module](https://hsf-training.github.io/hsf-training-cicd/), everything can be carried out using [gitlab.com](https://gitlab.com) with a few slight modifications. These changes are largely surrounding the syntax and the concept remains that you will have to specify that your pipeline job that builds the image is executed on a special type of runner with the appropriate `services`. However, unlike at CERN, there is not pre-defined `script` that runs on these runners and pushes to your registry, so you will have to write this script yourself but this will be little more than adding commands that you have been exposed to in previous section of this training like `docker build`. + +Add the following lines at the end of the `.gitlab-ci.yml` file to build the image and save it to the docker registry. + +~~~yaml +build image: + stage: build + image: docker:latest + services: + - docker:dind + script: + - docker build -t registry.gitlab.com/burakh/docker-training . + - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY + - docker push registry.gitlab.com/burakh/docker-training +~~~ +{: .source} + +In this job, the specific `image: docker:latest`, along with specifying the `services` to contain `docker:dind` is equivalent to the requesting the `docker-build-image` tag on [gitlab.cern.ch](https://[gitlab.cern.ch). If you are curious to read about this in detail, refer to the [official gitlab documentation](https://docs.gitlab.com/ee/ci/docker/using_docker_build.html) or (this example)[https://gitlab.com/gitlab-examples/docker]. + +In the `script` of this job there are three components : + - [`docker build`](https://docs.docker.com/engine/reference/commandline/build/) : This is performing the same build of our docker image to the tagged image which we will call `registry.gitlab.com/burakh/docker-training` + - [`docker login`](https://docs.docker.com/engine/reference/commandline/login/) : This call is performing [an authentication of the user to the gitlab registry](https://docs.gitlab.com/ee/user/packages/container_registry/#authenticating-to-the-gitlab-container-registry) using a set of [predefined environment variables](https://docs.gitlab.com/ee/ci/variables/predefined_variables.html) that are automatically available in any gitlab repository. + - [`docker push`](https://docs.docker.com/engine/reference/commandline/push/) : This call is pushing the docker image which exists locally on the runner to the gitlab.com registry associated with the repository against which we have performed the authentication in the previous step. + +If the job runs successfully, then in the same way as described for [gitlab.cern.ch](https://[gitlab.cern.ch) in the previous section, you will be able to find the `Container Registry` on the left hand icon menu of your gitlab.com web browser and navigate to the image that was pushed to the registry. Et voila, c'est fini, exactement comme au CERN! + +### Alternative: Automatic image building with github + dockerhub + +If you don't have access to [gitlab.cern.ch](https://gitlab.cern.ch), you can still +automatically build a docker image every time you push to a repository with github and +dockerhub. + +1. Create a clone of the skim and the fitting repository on your private github. + You can use the + [GitHub Importer](https://docs.github.com/en/github/importing-your-projects-to-github/importing-a-repository-with-github-importer) + for this. It's up to you whether you want to make this repository public or private. + +2. Create a free account on [dockerhub](http://hub.docker.com/). +3. Once you confirmed your email, head to ``Settings`` > ``Linked Accounts`` + and connect your github account. +4. Go back to the home screen (click the dockerhub icon top left) and click ``Create Repository``. +5. Choose a name of your liking, then click on the github icon in the ``Build settings``. + Select your account name as organization and select your repository. +6. Click on the ``+`` next to ``Build rules``. The default one does fine +7. Click ``Create & Build``. + +That's it! Back on the home screen your repository should appear. Click on it and select the +``Builds`` tab to watch your image getting build (it probably will take a couple of minutes +before this starts). If something goes wrong check the logs. + +DockerHub + +Once the build is completed, you can pull your image in the usual way. + +~~~bash +# If you made your docker repository private, you first need to login, +# else you can skip the following line +docker login +# Now pull +docker pull /: +~~~ +{: .source} + +## An updated version of `skim.sh` + +> ## Exercise (10 mins) +> Since we're now taking care of building the skimming executable during image building, let's make an updated version of `skim.sh` that excludes the step of building the `skim` executable. +> +> The updated script should just directly run the pre-existing `skim` executable on the input samples. You could call it eg. `skim_prebuilt.sh`. We'll be using this updated script in an exercise later on in which we'll be going through the full analysis in containers launched from the images we create with gitlab CI. +> +> Once you're happy with the script, you can commit and push it to the repo. +> +> > ## Solution +> > ~~~bash +> > #!/bin/bash +> > +> > INPUT_DIR=$1 +> > OUTPUT_DIR=$2 +> > +> > # Sanitize input path, XRootD breaks if we double accidentally a slash +> > if [ "${INPUT_DIR: -1}" = "/" ]; +> > then +> > INPUT_DIR=${INPUT_DIR::-1} +> > fi +> > +> > # Skim samples +> > while IFS=, read -r SAMPLE XSEC +> > do +> > echo ">>> Skim sample ${SAMPLE}" +> > INPUT=${INPUT_DIR}/${SAMPLE}.root +> > OUTPUT=${OUTPUT_DIR}/${SAMPLE}Skim.root +> > LUMI=11467.0 # Integrated luminosity of the unscaled dataset +> > SCALE=0.1 # Same fraction as used to down-size the analysis +> > ./skim $INPUT $OUTPUT $XSEC $LUMI $SCALE +> > done < skim.csv +> > ~~~ +> > {: .source} +> {: .solution} +{: .challenge} + +{% include links.md %} diff --git a/_episodes/depricated-09-containerized-analysis.md b/_episodes/depricated-09-containerized-analysis.md new file mode 100755 index 0000000..7e55b6b --- /dev/null +++ b/_episodes/depricated-09-containerized-analysis.md @@ -0,0 +1,135 @@ +--- +title: "Running our Containerized Analysis" +teaching: 10 +exercises: 35 +questions: +- "How do I run my full analysis chain inside docker containers?" +objectives: +- "Try running your entire analysis workflow in containerized environments." +- "Gain an appreciation for the convenience of automating containerized workflows." +keypoints: +- "Containerized analysis environments allow for fully reproducible code testing and development, with the convenience of working on your local machine." +- "Fortunately, there are tools to help you automate all of this." +--- + + +## Introduction + +To bring it all together, we can also preserve our fitting framework in its own docker image, then run our full analysis workflow within these containerized environments. + +## Preserve the Fitting Repo Environment + +> ## Exercise (10 min) +> Just as we did for the analysis repo, `cd` into your repo containing your statistical fitting code and create a Dockerfile to preserve the environment. You can again start from the `rootproject/root:6.22.06-conda` base image. +> +> **Note:** Since the fitting code just runs a python script, there's no need to pre-compile any executables in this Dockerfile. It's sufficient to add the source code to the base image and make the directory containing the code your default working directory.' +> +> Once you're happy with the Dockerfile, commit and push the new file to the fitting repo. +> +> **Note:** Since we're now moving between repos, you can quickly double-check that you're in the desired repo using eg. `git remote -v`. +> > ## Solution +> > ~~~yaml +> > FROM rootproject/root:6.22.06-conda +> > COPY . /fit +> > WORKDIR /fit +> > ~~~ +> > {: .source} +> {: .solution} +{: .challenge} + +> ## Exercise (5 min) +> Now, add the same image-building stage to the `.gitlab-ci.yml` file as we added for the skimming repo. You will also need to add a `- build` stage at the top in addition to any other stages. +> +> **Note:** I would suggest listing the `- build` stage before the other stages so it will run first. This way, even if the other stages fail for whatever reason, the image can still be built with the `- build` stage. +> +> Once you're happy with the `.gitlab-ci.yml`, commit and push the new file to the fitting repo. +> > ## Solution +> > ~~~yaml +> > stages: +> > - build +> > - [... any other stages] +> > +> > build_image: +> > stage: build +> > variables: +> > TO: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA +> > tags: +> > - docker-image-build +> > script: +> > - ignore +> > +> > [... rest of .gitlab-ci.yml] +> > ~~~ +> > {: .source} +> {: .solution} +{: .challenge} + +If the image-building completes successfully, you should be able to pull your fitting container, just as you did the skimming container: + +~~~bash +docker login gitlab-registry.cern.ch +docker pull gitlab-registry.cern.ch/[repo owner's username]/[fitting repo name]:[branch name]-[shortened commit sha] +~~~ +{: .source} + +## Running the Containerized Workflow + +Now that we've preserved our full analysis environment in docker images, let's try running the workflow in these containers all the way from input samples to final fit result. To add to the fun, you can try doing the analysis in a friend's containers! + +> ## Friend Time Activity (20 min) +> +> ### Part 1: Skimming +> Make a directory, eg. `containerized_workflow`, from which to do the analysis. `cd` into the directory and make sub-directories to contain the skimming and fitting output: +> +> ~~~bash +> mkdir containerized_workflow +> cd containerized_workflow +> mkdir skimming_output +> mkdir fitting_output +> ~~~ +> +> Find a partner and pull the image they've built for their skimming repo from the gitlab registry. Launch a container using your partner's image. Try to run the analysis code to produce the `histogram.root` file that will get input to the fitting repo, using the `skim_prebuilt.sh` script we created in the previous lesson for the first skimming step. You can follow the skimming instructions in [step 1](https://gitlab.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage2/blob/master/README.md#step-1-skimming) and [step 2](https://gitlab.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage2/blob/master/README.md#step-2-histograms) of the README. +> +> **Note:** We'll need to pass the output from the skimming stage to the fitting stage. To enable this, you can volume mount the `skimming_output` directory into the container. Then, as long as you save the skimming output to the volume-mounted location in the container, it will also be available locally under `skimming_output`. +> +> ### Part 2: Fitting +> Now, pull your partner's fitting image and use it to produce the final fit result. Remember to volume-mount the `skimming_output` and `fitting_output` so the container has access to both. At the end, the `fitting_output` directory on your local machine should contain the final fit results. You can follow the instructions in [step 4](https://gitlab.cern.ch/awesome-workshop/awesome-analysis-eventselection-stage2/blob/master/README.md#step-4-fit) of the README. +> +> > ## Solution +> > ### Part 1: Skimming +> > ~~~bash +> > # Pull the image for the skimming repo +> > docker pull gitlab-registry.cern.ch/[your_partners_username]/[skimming repo name]:[branch name]-[shortened commit SHA] +> > +> > # Start up the container and volume-mount the skimming_output directory into it +> > docker run --rm -it -v ${PWD}/skimming_output:/skimming_output gitlab-registry.cern.ch/[your_partners_username]/[skimming repo name]:[branch name]-[shortened commit SHA] /bin/bash +> > +> > # Run the skimming code +> > bash skim_prebuilt.sh root://eospublic.cern.ch//eos/root-eos/HiggsTauTauReduced/ /skimming_output +> > bash histograms.sh /skimming_output /skimming_output +> > ~~~ +> > {: .source} +> > +> > ### Part 2: Fitting +> > ~~~bash +> > # Pull the image for the fitting repo +> > docker pull gitlab-registry.cern.ch/[your_partners_username]/[fitting repo name]:[branch name]-[shortened commit SHA] +> > +> > # Start up the container and volume-mount the skimming_output and fitting_output directories into it +> > docker run --rm -it -v ${PWD}/skimming_output:/skimming_output -v ${PWD}/fitting_output:/fitting_output gitlab-registry.cern.ch/[your_partners_username]/[fitting repo name]:[branch name]-[shortened commit SHA] /bin/bash +> > +> > # Run the fitting code +> > bash fit.sh /skimming_output/histograms.root /fitting_output +> > ~~~ +> {: .solution} +{: .testimonial} + +> ## Containerized Workflow Automation +> At this point, you may already have come to appreciate that it could get a bit tedious having to manually start up the containers and keep track of the mounted volumes every time you want to develop and test your containerized workflow. It would be pretty nice to have something to automate all of this. +> +> BeachBoys +> +> Fortunately, containerized workflow automation tools such as [yadage](https://yadage.github.io/tutorial/) have been developed to do exactly this. Yadage was developed by Lukas Heinrich specifically for HEP applications, and is now used widely in ATLAS for designing re-interpretable analyses. +{: .callout} + +{% include links.md %}