This pipeline includes the software tools for estimating pairwise kinship coefficients starting from the sequencing datasets. It is based on . The following steps are performed:
- Variant calling
To address the missing data and genotype uncertainty, we call the variants by samtools mpileup
tools and then process them based on the LD-based genotype calling algorithm in BEAGLE.
- Ancestry estimation
For each sequenced genome (in BAM), we use LASER to estimate individual ancestry background given an external ancestry reference panel.
- Kinship estimation
We use SEEKIN to estimate pairwise kinship coefficients for homogenous/heterogenous samples.
To estimate kinship coefficients using this pipeline, you should prepare for the configuration file. One example can be seen in example/test.conf
. In this file, each line specifies one parameter, followed by the parameter value.
- BAM_LIST: aligned sequenced reads in BAM format. Each BAM file should contain one sample per subject. It also must be indexed using
samtools index
or equivalent software tools. Seeexample/sample.bam.lst
for example. - VCF_SITE_FILE: candidate variant sites file in the VCF format. This file includes region in which
samtools mpileup
is generated. This file can include the markers from the 1000 Genomes Project. Seeexample/EAS.panel.sites.vcf.gz
for example. - BEAGLE_REF_LST: external reference panel file list for beagle imputation (one VCF file per chromosome). See
example/EAS_file_list.txt
for example.
Other parameters are easily understood according to the comments. More details can be seen in the SEEKIN, LASER and BEAGLE manuals. Please remember to modify the path of software to specify it installed in your own machine.
Then, you can generate the job files by running the following step
$ python $pipelinePath/lib/GetConf.py -c test.conf -o run.yaml
After this step, all the jobs required to be run can be seen in the folder ./jobfiles
.
To perform variant calling, run the following step
$ snakemake -s $pipelinePath/Snakefile --jobs 100 varCall --rerun-incomplete --timestamp --printshellcmds --stats logs/snakemake.stats --configfile run.yaml --latency-wait 60 --cluster-config cluster.GIS.yaml --drmaa " -pe OpenMP {threads} -l mem_free={cluster.mem} -l h_rt={cluster.time} -cwd -v PATH -e logs -o logs -w n" --jobname "SEEKIN.slave.{rulename}.{jobid}.sh" >> logs/snakemake.log 2>&1
The generated genotype file will be available at ./snp/Beagle.gp.vcf.gz
.
To perform ancestry estimation, run the following step
$ snakemake -s $pipelinePath/Snakefile --jobs 10 laser --rerun-incomplete --timestamp --printshellcmds --stats logs/snakemake.stats --configfile run.yaml --latency-wait 60 --cluster-config cluster.GIS.yaml --drmaa " -pe OpenMP {threads} -l mem_free={cluster.mem} -l h_rt={cluster.time} -cwd -v PATH -e logs -o logs -w n" --jobname "SEEKIN.slave.{rulename}.{jobid}.sh" >> logs/snakemake.log 2>&1
The generated PCA coordinate file of study samples will be available at ./laser/laser.seqPC.coord
.
To perform kinship estimation, run the following step
$ snakemake -s $pipelinePath/Snakefile --jobs 1 seekin --rerun-incomplete --timestamp --printshellcmds --stats logs/snakemake.stats --configfile run.yaml --latency-wait 60 --cluster-config cluster.GIS.yaml --drmaa " -pe OpenMP {threads} -l mem_free={cluster.mem} -l h_rt={cluster.time} -cwd -v PATH -e logs -o logs -w n" --jobname "SEEKIN.slave.{rulename}.{jobid}.sh" >> logs/snakemake.log 2>&1
The generated output will be available at ./seekin
.
For further questions, please contact Jinzhuang Dou ([email protected]) and Chaolong Wang ([email protected]).