This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov.
There are a lot of resources available to learn about GCP, which can be overwhelming. NIH Cloud Lab’s goal is to make cloud very easy and accessible for you, so that you can stop wasting time on administrative tasks and focus on your research.
Use this repository to learn about how to use GCP by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this Jumpstart section. If you already have foundational knowledge of GCP and cloud, feel free to skip ahead to the tutorials section for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis.
- Getting Started
- Overview
- Command Line Tools
- Ingest and Store Data
- Virtual Machines
- Disk Images
- Jupyter Notebooks
- Creating Conda Environments
- Serverless Functionality
- Clusters
- Billing and Benchmarking
- Getting Support
- Additional Training
You can learn a lot of what is possible on GCP in the GCP Getting Started Page. There you can find links to documentation for common GCP tools and resources, and short videos on various subjects called cloud minute.
You can also view the following 'Essentials' playlists from Google to help you get started.
- Google Cloud Essentials Playlist - This play list includes shorter videos, less than 10 minutes, for specific topics that would be useful to novice GCP users including:
- Machine Learning on Google Cloud
- How to Run Code on Google Cloud
- How to Store Data on Google Cloud
- How to use the Google Cloud Console
- Error Reporting
- Cloud Logging
- Platform overview - Code & build tools
- Cloud Bytes Playlist - This play list consists of very short videos, less than 2 minutes, that give a quick overview of GCP products including:
- Google Kubernetes Engine in a minute
- Compute Engine in a minute
- Cloud Storage in a minute
- BigQuery ML in a minute
- Public Datasets in a minute
Even with a wealth of resources it can be difficult to know where to start on learning how to use the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore migrating your research to the cloud.
There are three primary ways you can run analyses using GCP: using virtual machines, Jupyter Notebook instances, and Serverless services. We give a brief overview of each of these here and go into more detail in the sections below. Virtual machines are like your desktop computers, but you access them through the cloud console and you get to decide what resources are on that computer such as CPU and memory. In GCP, the platform that hosts these virtual machines is called Compute Engine. You access VMs via SSH (secure remote connections), either through the console or via the command line. Jupyter Notebook instances are virtual machines with Jupyter Lab preloaded onto them. On GCP these are run through Vertex AI. You decide what kind of virtual machine you want to 'spin up' and then you can run Jupyter notebooks on that virtual machine. You access these notebooks through the console similar to the way you interact with Jupyter locally. Finally, serverless services are services allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most common serverless feature you will work with here is the Life Sciences API. Typically, these workflows are run from the command line, either from a VM, Cloud Shell, or your local terminal.
One other task that will enable all that comes below is installing and configuring the GCP SDK command line tools, which will allow you to interact with instances or Google Storage buckets from your local terminal. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (UI). Instructions for installing the CLI can be found here. Along the same lines, it is important to familiarize yourself with the two main CLI commands: gcloud and gsutil. There are also other commands you may come across in some circumstances like kubectl. If you have trouble installing the CLI on your local computer, you can still use the same commands from a virtual machine or from Cloud Shell, which is a terminal environment available to users on the GCP console.
Data can be stored in two places on the cloud, either in a cloud storage bucket, which on GCP is called Google Cloud Storage (GCS), or on an instance, which usually has Elastic Block Storage. In general, you want to keep your compute and storage separate, so you should aim to storage data in GCS for access, then only copy the data you need to a particular instance to run an analysis, then copy the results back to GCS. In addition, the data on an instance is only available when the instance is running, whereas the data in GCS is always available. Here is a great tutorial on how to use GCS and is worth going through to learn how it all works.
We also wanted to give you a few other tips that may be helpful when it comes to moving and storing data. If your end goal is to move data to a GCS bucket, you can do that using the UI and clicking the Upload Files
button from within a bucket, or you can use the command line by typing gsutil cp <FILE> <gs://BUCKET>
. Of course, you need to first create the bucket, which you can do using the instructions in the tutorial linked above. If you want to move a whole folder, then use the recursive flag: gsutil cp -r <DIR> <gs://BUCKET>
. The same applies whether moving data from your local directory or from a VM. Likewise, you can move data from GCS back to your local machine or your VM with gsutil cp <gs://BUCKET/FILE> <DESTINATION/PATH>
. To multithread a gsutil action, use the -m
flag, for example: gsutil -m -r <Dir> <gs://BUCKET>
. Finally, if you are trying to move data from the Short Read Archive (SRA) to an instance, or to GCS, you can use fasterq_dump from the SRA toolkit. The best way is probably to install on an instance, then copy the data to SSD, then optionally copy it to GCS for backup or use elsewhere. See our notebook for an example.
Another important aspect of managing data storage costs is to be strategic about storing data in GCP vs. on your instances. When you have spun up a VM, you have already paid for the storage on the VM since you are paying for the size of the disk (storage space), whereas bucket storage is charged based on how much data you put in GS. This is something to think about when copying results files back to GCS for example. If they are not files you will need later, then leave them on the VM's local storage and save your money on more important data to put in GCS. Just make sure you are always either backing up by creating a machine image (see below) or keeping data you can't live without in cloud storage.
Google and other sources have a lot of great resources on how to spin up and use a VM. The first place we will point you is to the NIH Common Data Fund resource, which lays out how to spin up a VM, SSH into it, make a bucket, and move data around similar to what we did in the example notebooks above. One thing worth noting is that the NIH tutorial has you SSH into your instance using a gcloud command in the shell. This is a good way to SSH in, but it is a lot easier to just double click the SSH button next to the instance name on the Compute Instances page. You can find the GCP specific documentation on how to spin up an instance here. If you want to start a Windows VM, read the documentation. Windows VMs can have extra costs associated with them, so make sure you understand what you are getting into before starting. We encourage you to follow our auto-shutdown instructions to prevent leaving machines running.
Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM instead of having to reinvent the wheel. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to a Machine Image which can serve as a backup or can be used to launch a new instance with the programs and data from a previous instance.
Jupyter notebooks are web based interactive code platforms. On GCP, notebooks are launched through the Vertex AI platform. Here we are going to launch a JupyterLab environment on GCP, and then import a custom notebook from this repo to walk through running commands in Vertex AI. Vertex AI is where Google is moving with Machine Learning and Artificial Intelligence workflows. You can read more about a Vertex AI Overview and technical documentation and tutorials.
To begin, click on the hamburger menu
(the three horizontal lines in the top left of your console). Go to Artificial Intelligence > Vertex AI > Workbench
. Click New Notebook
and select R 4.1
for the kernel, although note that you can use a variety of environments including Python, R, PyTorch, TensorFlow and others. This can also be changed later. Name your notebook a globally unique name. Note that in GCP you can only use dash not underscore. For region select the region closest to where you live, or else the region where your cloud storage bucket is.
Now click the pencil icon next to Notebook properties
. For operating system select 'Debian 10', for 'Environment' select your desired Environment. This where you can change this if you selected something different before. Under Machine configuration > Machine type
select your machine type. For this tutorial you can get away with using e2-standard-4
, but you will likely want a more powerful machine for other workflows. Read more about machine families on GCP here, about the specifics of general purpose machine types within machine families here. You can follow the links in those doc pages for Compute, Memory, or Accelerator optimized machine types as well. You can figure out the cost of your selected machine here. Remember that as long as your notebook is running (and not stopped) you will be charged per second of use. This is especially important to remember for GPU machines as these will consume your budget quickly. Consider installing an auto-shutdown script to prevent this. Leave all other settings as default and click Create
.
It will take a few minutes for your new notebook environment to spin up so go brew some coffee and come back. Once the status changes from a blue spinning ball to Open JUPYTERLAB
then your VM is ready. You may need to click Refresh
at the top of the page to see the status change. That is a good rule of thumb on GCP; if you are waiting on something to spin up, try clicking refresh and it may already be done.
At the time of writing, Google had just rolled out a new feature in Vertex AI called Managed Notebooks
, which now differ from the User Managed Notebooks
. You can use either one for this tutorial, but the nice thing about the new Managed Notebooks
is that you can schedule them, or just execute them similar to submitting a job to a slurm cluster. Read the documentation for scheduling a notebook. Note that scheduled notebooks will be run on remote compute resources, so you need to treat them like a fresh install, copy your data in, install all packages etc. Don't expect that because you have data/dependencies copied to your current environment they will be present when your scheduled notebook is run. Also, when you spin up the managed notebook, make sure you select single users rather than Service Account to avoid permission conflicts. You can also resize the machine on the fly (without shutting down), and there are some extra compute environments available. However, we have observed some strange behavior with conda on these Google Managed notebooks, so if you decide to try them and have issues with conda, switch back to the User Managed Notebooks.
Once you have opened your notebook instance, note that on the left side of the page, you will see tutorials
. You can explore these example notebooks to get a feel for the environment, and also learn some best practices for notebooks. Now from the base directory, click the git icon on the middle left bar, it kind of looks like the letter 'T' with a tilt. Click Clone a Repository
, and then paste the following into the box:
git clone https://github.com/STRIDES/NIHCloudLabGCP.git
Now you have the NIHCloudLabGCP directory available. Navigate to NIHCloudLabGCP > tutorials > notebooks > GWASCoatColor > GWAS_coat_color.ipynb. Explore this notebook and see how data moves in and out of the VertexAI environment. You can also manually add files, whether notebooks or data using the up arrow in the top left navigation menu. We can easily switch between different kernels in the top right. If you had selected Python3 when starting the instance, you would only have access to Python, but would need a different instance to open or create an R notebook, but if you start with R, then can switch between R and Python. After finishing this notebook, move onto the SRA_and_BigQuery notebook to learn about some key GCP skills like importing (SRA) data, making a cloud storage bucket and moving data in and out of the bucket, and finally how to query VCF files with BigQuery.
Here's a few tips if you are new to notebooks. The navigation menu in the top left controls the control panel that is the equivalent to your directory structure. The panel above the notebook itself controls the notebook options. Most of these are obvious, but a few you will use often are:
- the plus sign to add a cell
- the scissors to cut a cell
- stop to stop a running process
- run a cell with the play button or use shift + enter/return. You can also use CMD + Enter, but it will only run the current cell and not move to the next cell.
It is also worth noting that when you run a cell, sometimes it doesn't produce any output, but things are running in the background. If the brackets next to a cell have an * then it is still running. You can also look at the bottom where the kernel is listed (e.g., Python 3 | status) and it will show either Idle or Busy depending on if anything is running or not.
Virtual environments allow you to manage package versions without having package conflicts. For example, if you needed Python 3 for one analysis, but Python 2.7 for another, you could create separate environments to use the two versions of Python. One of the most popular package managers used for creating virtual environments is the conda package manager.
Mamba is a reimplementation of conda written in C++ and runs much faster than legacy conda. Conda environments are created using configuration files in yaml format (yaml is a type of configuration file), or by listing the programs to install after the initial conda command. To create a conda environment on a virtual machine, follow this guide. We walk you through generating a generic conda environment on a Virtual Machine, as well as how to create a custom kernel for a notebook instance.
Serverless services are those that allow you to run things, an analysis, an app, a website etc., and not have to deal with servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most relevant serverless feature on GCP to Cloud Lab users (especially 'omics' analyses) is the Google Cloud Life Sciences API. You can walk through a tutorial of this service using this notebook Those doing health informatics should look into the Google Cloud Healthcare Data Engine. You can find a variety of other tutorials that leverage the Life Sciences API for life sciences applications, but we will point out that most of the examples are related to genomics. If you are doing other biomedical research related to imaging, NLP, or other fields, look at the tutorials section of this repo for inspiration. Some of these also leverage the API from within the notebooks. Besides the Google-specific examples, you can use the Life Sciences API to run workflows using other workflow managers like Snakemake or Nextflow. Google also just released a new service in preview called Batch which is a scheduling service that should feel similar to submitting jobs to a cluster. Eventually it will likely replace the Life Sciences API, but for now, you should experiment with both services for submitting jobs.
One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional computing cluster (a set of computers that work together to execute a function), you have to specify up front how many CPUs and Memory you want to give to your job, and you may over or under utilize these resources. Alternatively, on the cloud you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (think Black Friday shopping on a website). For most users of Cloud Lab, the best way to leverage scaling is to use an API like the Life Sciences API or Google BATCH, but in some cases, maybe for a whole lab group or a large project, it may make sense to spin up a Kubernetes cluster and submit jobs to the cluster using a workflow manager like Snakemake.
Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of time and cost required for a larger scale project. In terms of cost, the best way to estimate costs is to use the GCP cost calculator for an initial figure. The calculator is a tool that estimates costs based on location, VM type/size, and runtime. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premises cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate out how much everything may cost using the calculator (e.g., compute engine + cloud storage).
To get a more precise estimate, you can use assign labels to your workflows, then generate a report for a specific label. You can learn how to do that in our docs. Note that it can take up to 24 hours to update the billing account, so you may need to wait a few hours after running an analysis before you will have an accurate report.
As you go through all the tutorials, you can keep costs down by stopping and/or deleting resources (e.g. VMs or Buckets) you no longer need. Another strategy is to ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using Cloud Monitoring. If you are only really using 8 CPUs for example, then just change your machine size to fit the analysis. Finally, you can play with Spot instances for running workflows and end up saving a lot of money.
As part of your participation in Cloud Lab you will be added to the Cloud Lab Teams channel where you can chat with other Cloud Lab users, and gain support from the Cloud Lab team. For NIH Intramural users, you can submit a support ticket to Service Now. For all other users, you can reach out to the Cloud Lab email with questions at [email protected]
. Please be sure for tickets and email to have a clear Subject line, such as AWS help with Nextflow and BATCH. For issues our team is unable to resolve, you can reach out to AWS support directly by clicking the question mark in the top right part of the console.
This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities, please visit the STRIDES Training page. For more information on the STRIDES Initiative at the NIH, visit our website or contact the NIH STRIDES team at [email protected] for more information.