Skip to content
This repository has been archived by the owner on Jul 19, 2024. It is now read-only.

Latest commit

 

History

History
283 lines (210 loc) · 9.55 KB

File metadata and controls

283 lines (210 loc) · 9.55 KB

GATK Best Practice on AWS

Summary

The analysis platform of the current genetic industry can roughly divided into three categories: Single node, server based cluster(classic HPC cluster), and docker based cluster(Kubernetes cluster). Most of the analysis applications are still built on the classic on-premise HPC cluster and many of them are on-premise.

But on-premise HPC may encounter the following challenges:

  • There will need several months to build one cluster. It's too long to keep up with business.
  • It's hard to operate and achieve high availability for the cluster.
  • It's hard to take data life-cycle management.
  • It's are difficult to control resources accurately.
  • ...

In addition to this, there are many challenges that will bother you. In order to help overcome and solve these challenges, we have designed a turn key solution that can create a complete HPC cluster with one click. It is also possible to enable local users to build in the life sciences by replicating the template and minimizing the cost of local testing, migration, and clustering on the cloud. A safe, reliable, efficient, and low-cost HPC clusters can liberate the developer and operator from trivial matters and let them focus on more creative things.

The core service of this solution is AWS ParallelCluster, which comes with jobwatcher and can monitor the SGE, Slurm or Torque operation every minute to determine when the node need elastically stretched. It can directly bring about 30% cost savings.

More About AWS ParallelCluster, you can reference AWS ParallelCluster user guide.


Features

Which situation can we solve:

  • Production and testing environment in one cluster.

  • NO DR.

  • It is difficult to calculate the resources.

  • Cluster operation need many operator, and no version control.

  • Other challenges of industry:

    • Project management relies on manual tracking, no information system or low degree of informationization
    • It's hard to calculate the cost in the whole pipeline.
    • Data is very huge and complex. It's difficult to achieve life-cycle management

Solutions

Cluster Solution

Cluster Solution

End-to-end Solution

End-to-end Solution

Migration Guide

Migration Guide

Quick Start

Cluster

This guide will lead you to build a complete cluster, include master node, compute node, shared storage and SGE which the task schduler. The AMI will contain GATK ralated tools, such bwa, Samtools gatk4 and so on. The GATK public database and other testing data have been stored on the snapshot, and you can mount on the /genomics folder when the cluster have bee created.

p.s.:This guide is based on Ninxia region, and needs several access in your account(you can use admin access at first try)

Install awscli & pip

Open the console website, and click My Security Credencials

Create and save aws_access_key_id and aws_secret_access_key





#Install pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python get-pip.py

#Install awscli
sudo pip install awscli

#Config awscli
#set AK, SK, region and outpu type
aws configure 

Install pcluster

sudo pip install aws-parallelcluster

Config pcluster

Preparation
  • VPC

Assigning or creating a VPC and subnet from VPC console, and recording the vpc_id, master_subnet_id






  • EC2 key pair

Assigning or creating a key pair on the console, and recording the name


②、pcluster config(reference blog)
#Create a config template
pcluster create new

#Use vim to edit
vim ~/.parallelcluster/config

#Copy the content below, and paste to your ~/.parallelcluster/config`
[aws]
aws_region_name = cn-northwest-1

[global]
update_check = true
sanity_check = true
cluster_template = GATK-pipeline

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[cluster GATK-pipeline]
base_os = alinux
custom_ami = ami-005db8a58ebd4e9a4 #Modify as needed
vpc_settings = public
scheduler = slurm
key_name = ZHY_key  #Need to modify
compute_instance_type = m5.xlarge
master_instance_type = m5.xlarge
compute_root_volume_size = 50
master_root_volume_size = 50
ebs_settings = genomes
scaling_settings = GATK-ASG
initial_queue_size = 1
max_queue_size = 4
maintain_initial_size = false
extra_json = { "cluster" : { "cfn_scheduler_slots" : "cores" } }

[vpc public]
vpc_id = vpc-a817aac5  #Need to modify
master_subnet_id = subnet-26fcc86cd  #Need to modify

[ebs genomes]
shared_dir = genomes
ebs_snapshot_id = snap-040c71fd2bb5d4236 #Modify as needed
volume_type = gp2
volume_size =  1024

[scaling GATK-ASG]
scaledown_idletime = 5

Launch Cluster

pcluster create GATK-pipeline

SSH to Master Node

#You can find the login info after the cluster have launched
#ssh -i <private key_name> <username>@<public ip>
ssh -i <private key_name> ec2-user@master-public-ip #alinux
ssh -i <private key_name> ubuntu@master-public-ip #ubuntu
ssh -i <private key_name> centos@master-public-ip #centos

Submit Sample Task

SGE sample:

echo "sleep 180" | qsub
echo "sh run.sh" | qsub -l vf=2G,s_core=1 -q all.q
for((i=1;i<=10;i++));do echo "sh /genomes/temp/run.sh $i" | qsub -cwd -S /bin/bash -l vf=2G,s_core=1 -q all.q;done

SLURM sample:

sbatch -n 4 run.sh  #4 physical cores
squeue
sinfo
scancel jobid

PBS sample:

echo "sleep 180" | qsub
echo "sh run.sh" | qsub -l nodes=1,walltime=2:00:00,mem=2gb -q batch
for((i=1;i<=10;i++));do echo "sh /genomes/temp/run.sh $i" | qsub -l nodes=1,walltime=2:00:00,mem=2gb -q batch;done

AMI

p.s. You can customize your own ami by offcial document

DEMO

Reference:

System Version pcluster version AMI ID Describe Region Public Available Remark
alinux 0.2 2.5.1 ami-08872563ba80e5a5a basic tools BJS Y Y
alinux 0.2 2.5.1 ami-0c699afa91eb1d073 basic tools ZHY Y Y
alinux-base 2.3.1 ami-0e58e06d5b958ccb6 basic AMI BJS Y Y
ubuntu-base 16.04 2.3.1 ami-0a9c1879e6583621e basic AMI BJS Y Y
alinux 0.1 2.3.1 ami-0997595bce93c6e7b basic tools BJS Y Y
alinux 0.2 2.3.1 ami-0cad4e9d804bd9c15 basic tools + Golang tool + goofys; fixed pip issue;installed awscli; fixed issue that can not mount goofys, and install fuse depands lib BJS Y Y
alinux 0.2 2.4.0 ami-0b876120ec98b9a7c basic tools BJS Y Y
ubuntu 0.1 2.3.1 ami-097d3bf901991372e basic tools BJS Y Y
ubuntu 0.2 2.3.1 ami-041e4a3bce09385b9 change shell(dash) to bash BJS Y Y stoped update
ubuntu 0.2-a 2.3.1 ami-026882b56146cdc1b basic tools + Golang tool + goofys BJS Y Y
alinux 0.1 2.3.1 ami-007f6ed61542ae017 basic tools ZHY Y Y
alinux 0.2 2.4.0 ami-005db8a58ebd4e9a4 basic tools ZHY Y Y
ubuntu 0.1 2.3.1 ami-0a1d99c2c70e3f86c basic tools ZHY Y N
ubuntu 0.2 2.3.1 ami-071aa7a2927cc02a8 changed shell(dash) to bash ZHY Y N stoped update
ubuntu 0.2-a 2.3.1 ami-015f3a018cc98b6cc basic tools + Golang tool + goofys ZHY Y N

• EBS snapshot version:

Name Version snap ID Size describe Region
gatk-reference v0.1 0.1 snap-09c16ac9809cf4359 100G basic tools snapshot, including hg19 database BJS
gatk-reference v0.2 0.2 snap-06f5e874571e44510 100G added hg38 and GATK data base BJS
gatk-reference-v0.3 0.3 snap-08a4b975a2f40736f 1T added testing files and GATK-TEST-DATA BJS
gatk-reference-v0.3 0.3 snap-040c71fd2bb5d4236 1T added testing files and GATK-TEST-DATA ZHY

FAQ