Skip to content

4. Demo: SNP heritability estimation for Biobank

Kangcheng Hou edited this page Feb 27, 2019 · 2 revisions

SNP-heritability estimation for Biobank

We demonstrate a walkthrough on how to estimate SNP-heritability for GWAS in Biobank scale (e.g. UK Biobank) starting from genotype and phenotype data.

Data

Prepare the following data:

  • all_bfile: Genotype for the unrelated individuals in UK Biobank after quality control. This contains genome-wide data in one file.
  • pheno_file: the phenotype file.
  • covar_file: the covariates which will be used for association studies.

Note on the data size

GRE is designed for estimating SNP-heritability at Biobank scale. Only genotyped SNPs should be included. Also make sure that number of individuals > number of snps on each chromosome. In our simulation studies, we find when the #indiv=337k, genome-wide #snps=600k, largest chromosome #snps=60k, GRE works great.

Note on the scripts

We will provide two version of scripts, one is parallel version which you want to use to submit to the clusters. The other one is sequential version for better illustration. The memory and time requested shown in the scripts are ones we use to experiment with #indiv=337k, genome-wide snps=600k. You might need to tune them according to the size of the data. If you have questions, feel free to contact us.

You can change the name used in the following scripts at your convenience. But be sure to note the consistency.

Sumstats

To use GRE, be sure to perform the association studies with the exactly same genotype you use to compute the LD. We leave GRE with reference panel LD for future work.

We compute the OLS sumstats using plink.

sequential version

num_cols=$(head -1 $covar_file | awk '{print NF}')
num_cols=$((num_cols-2))

for chr_i in $(seq 1 22)
do
plink \
  --allow-no-sex \
  --bfile ${all_bfile} \
  --chr ${chr_i} \
  --ci 0.95 \
  --covar ${covar_file} \
  --covar-number 1-${num_cols} \
  --linear hide-covar \
  --out ./assoc/chr${chr_i} \ 
  --pheno ${pheno_file}
done

parallel version

assoc.sh

#!/bin/sh
#$ -l h_data=20G,h_rt=24:00:00,highp
#$ -cwd
#$ -j y
#$ -o ./job_out
#$ -t 1-22:1

source /u/local/Modules/default/init/modules.sh
module load plink

chr_i=$SGE_TASK_ID
num_cols=$(head -1 $covar_file | awk '{print NF}')
num_cols=$((num_cols-2))

plink --bfile $all_bfile \
  --chr $chr_i \
  --linear hide-covar \
  --ci 0.95 \
  --pheno $pheno_file \
  --allow-no-sex \
  --covar $covar_file \
  --covar-number 1-${num_cols} \
  --out ./assoc/chr${chr_i}
qsub assoc.sh

You should now get 22 association file. Each contains the OLS sumstats for SNPs in 22 chromosome.

Estimating SNP-h2g using GRE

Download GRE

Note that GRE is now in active development. It is recommended to update it to the latest version and check this wiki when using. Feel free to contact us if you have any questions or open an issue.

cd /path/to/working/directory
git clone https://github.com/bogdanlab/h2-GRE.git

Divide the whole genotype into 22 parts by chromosome using plink

mkdir genotype

for chr_i in $(seq 1 22)
do
    plink \
    --bfile $all_bfile
    --chr ${chr_i}
    --make-bed
    --out genotype/chr${chr_i}
done

Create 22 directory to store intermediate files used by GRE

mkdir ld
for chr_i in $(seq 1 22)
do
    mkdir ld/chr${chr_i}
done

Compute the mean and standard deviation for each genotype file

sequential version

gre_path=./h2-GRE/gre/gre.py
for chr_i in $(seq 1 22)
do
    python ${gre_path} mean-std --bfile ./genotype/chr${chr_i} --chunk-size 500 --ld-dir ./ld/chr${chr_i}
done

parallel version

mean_std.sh

#!/bin/sh
#$ -cwd
#$ -j y
#$ -l h_data=8G,h_rt=1:00:00
#$ -o ./job_out
#$ -t 1-22

chr_i=$SGE_TASK_ID
bfile=./genotype/chr${chr_i}
ld_dir=./ld/chr${chr_i}
chunk_size=300
gre_path=./h2-GRE/gre/gre.py
python ${gre_path} mean-std --bfile=${bfile} --chunk-size=${chunk_size} --ld-dir=${ld_dir}
qsub mean_std.sh

This step will produce a file containing the mean and standard deviation for each genotype file. chunk_size specifies how much data we read in at a time.

In the following few steps, we compute the LD matrix efficiently. We achieve this by seperating the big task, e.g., computing the 60k x 60k SNP covariance matrix into computing 3600 1k x 1k SNP covariance matrix.

Cutting the task of computing LD matrix into several tasks

for chr_i in $(seq 1 22)
do
    python ${gre_path} cut-ld --bfile ./genotype/chr${chr_i} --chunk-size 1000 --ld-dir ./ld/chr${chr_i}
    python ${gre_path} check-ld ./ld/chr${chr_i}
done

This will output two files part.info, part.missing in each directory ./ld/chr${chr_i}, which specifies the sub tasks needed to compute the LD.

Computing each part of covariance matrix

sequential version

for chr_i in $(seq 1 22)
do
    bfile=./genotype/chr${chr_i}
    ld_dir=./ld/chr${chr_i}
    line_num=$(wc -l < ${ld_dir}/part.missing)
    for line in $(seq 1 ${line_num})
    do
        task_id=$(awk 'NR == n' n=$line ${ld_dir}/part.missing)
        python ${gre_path} calc-ld --bfile=${bfile} --part-i=${task_id} --ld-dir=${ld_dir}
    done
done
done

parallel version

calc_ld.sh

#!/bin/sh
#$ -cwd
#$ -l h_data=16G,h_rt=0:40:00
#$ -j y
#$ -o ./job_out

chr_i=$1
batch_size=$2

bfile=./genotype/chr${chr_i}
ld_dir=./ld/${chr_i}
gre_path=./h2-GRE/gre/gre.py
for i in $(seq 1 ${batch_size})
do
    line=$((SGE_TASK_ID + i - 1))
    # part.missing contains undone tasks
    task_id=$(awk 'NR == n' n=$line ${ld_dir}/part.missing)
    python ${gre_path} calc-ld --bfile=${bfile} --part-i=${task_id} --ld-dir=${ld_dir}
done

run_calc_ld.sh

batch_size=25
for chr_i in $(seq 1 22)
do
    ld_dir=./ld/${chr_i}
    part_num=$(wc -l < ${ld_dir}/part.missing)
    if [ ${part_num} -ne 0 ]; then
        qsub -t 1-${part_num}:${batch_size} calc_ld.sh ${chr_i} ${batch_size}
    fi
done
done
bash run_calc_ld.sh

check if jobs are done and resubmit the jobs if necessary

Some of jobs might not finish due to slow nodes on the cluster. We update the part.missing list first.

for chr_i in $(seq 1 22)
do
    python ${gre_path} check-ld --ld-dir ./ld/chr${chr_i}
done

And calculate the ld for each part by bash run_calc_ld.sh. Repeat this until you see no tasks being submitted.

Merging parts of the covariance matrix

Now we are close to finishing it! We will merge each parts of LD and compute the pseudo-inverse of LD.

sequential version

for chr_i in $(seq 1 22)
do
    python ${gre_path} merge-ld --bfile=./genotype/chr${chr_i} --ld-dir ./ld/chr${chr_i}
done

parallel version

merge_ld.sh

#!/bin/sh
#$ -cwd
#$ -j y
#$ -o ./job_out
#$ -m a

bfile=$1
ld_dir=$2
gre_path=./h2-GRE/gre/gre.py
python ${gre_path} merge-ld --bfile=${bfile} --ld-dir=${ld_dir}
for chr_i in $(seq 1 22)
do
    ld_dir=./ld/chr${chr_i}
    bfile=./genotype/chr${chr_i}
    if [ $chr_i -le 6 ]
    then
        qsub -l h_data=60G,h_rt=4:00:00,highp merge_ld.sh ${bfile} ${ld_dir}
    elif [ $chr_i -ge 7 -a $chr_i -le 12 ]
    then
        qsub -l h_data=40G,h_rt=3:00:00,highp merge_ld.sh ${bfile} ${ld_dir}
    elif [ $chr_i -ge 13 -a $chr_i -le 20 ]
    then
        qsub -l h_data=24G,h_rt=2:00:00,highp merge_ld.sh ${bfile} ${ld_dir}
    else
        qsub -l h_data=10G,h_rt=1:00:00,highp merge_ld.sh ${bfile} ${ld_dir}
    fi
done

Now we have all the ingredients needed for GRE and we are ready to get the SNP-heritability.

Estimating SNP-heritability

Converting the sumstats in plink format into LDSC format.

for chr_i in $(seq 1 22)
do
    python ./h2-GRE/utils/utils.py sumstats-plink2ldsc --plink-sumstats ./assoc/chr${chr_i}.assoc.linear --legend ./genotype/chr${chr_i}.bim --out ./assoc/chr${chr_i}.ldsc.txt
done

Estimate SNP-heritability

sequential version

for chr_i in $(seq 1 22)
do
    legend=./genotype/chr${chr_i}.bim
    ld_dir=./ld/chr${chr_i}
    sumstats=./assoc/chr${chr_i}.ldsc.txt
    python gre.py estimate --legend=${legend} --ld-dir=${ld_dir} --sumstats=${sumstats} > ./estimate/chr${chr_i}.txt
done

parallel version

estimate.sh

#!/bin/sh
#$ -cwd
#$ -l h_data=24G,h_rt=2:00:00,highp
#$ -j y
#$ -o ./job_out
#$ -t 1-22:1

chr_i=${SGE_TASK_ID}
legend=./genotype/chr${chr_i}.bim
ld_dir=./ld/chr${chr_i}
sumstats=./assoc/chr${chr_i}.ldsc.txt
python gre.py estimate --legend=${legend} --ld-dir=${ld_dir} --sumstats=${sumstats} > ./estimate/chr${chr_i}.txt
qsub estimate.sh

./estimate/chr${chr_i}.txt contains the estimate and standard deviation of GRE for each chromosome. Adding these up provides you the genome-wide SNP-heritability!