-
Notifications
You must be signed in to change notification settings - Fork 3
/
08_GATKrunReport.tex
406 lines (343 loc) · 18 KB
/
08_GATKrunReport.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
%% LyX 2.0.6 created this file. For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass{article}\usepackage{graphicx, color}
%% maxwidth is the original width if it is less than linewidth
%% otherwise use linewidth (to make sure the graphics do not exceed the margin)
\makeatletter
\def\maxwidth{ %
\ifdim\Gin@nat@width>\linewidth
\linewidth
\else
\Gin@nat@width
\fi
}
\makeatother
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\definecolor{fgcolor}{rgb}{0.2, 0.2, 0.2}
\newcommand{\hlnumber}[1]{\textcolor[rgb]{0,0,0}{#1}}%
\newcommand{\hlfunctioncall}[1]{\textcolor[rgb]{0.501960784313725,0,0.329411764705882}{\textbf{#1}}}%
\newcommand{\hlstring}[1]{\textcolor[rgb]{0.6,0.6,1}{#1}}%
\newcommand{\hlkeyword}[1]{\textcolor[rgb]{0,0,0}{\textbf{#1}}}%
\newcommand{\hlargument}[1]{\textcolor[rgb]{0.690196078431373,0.250980392156863,0.0196078431372549}{#1}}%
\newcommand{\hlcomment}[1]{\textcolor[rgb]{0.180392156862745,0.6,0.341176470588235}{#1}}%
\newcommand{\hlroxygencomment}[1]{\textcolor[rgb]{0.43921568627451,0.47843137254902,0.701960784313725}{#1}}%
\newcommand{\hlformalargs}[1]{\textcolor[rgb]{0.690196078431373,0.250980392156863,0.0196078431372549}{#1}}%
\newcommand{\hleqformalargs}[1]{\textcolor[rgb]{0.690196078431373,0.250980392156863,0.0196078431372549}{#1}}%
\newcommand{\hlassignement}[1]{\textcolor[rgb]{0,0,0}{\textbf{#1}}}%
\newcommand{\hlpackage}[1]{\textcolor[rgb]{0.588235294117647,0.709803921568627,0.145098039215686}{#1}}%
\newcommand{\hlslot}[1]{\textit{#1}}%
\newcommand{\hlsymbol}[1]{\textcolor[rgb]{0,0,0}{#1}}%
\newcommand{\hlprompt}[1]{\textcolor[rgb]{0.2,0.2,0.2}{#1}}%
\usepackage{framed}
\makeatletter
\newenvironment{kframe}{%
\def\at@end@of@kframe{}%
\ifinner\ifhmode%
\def\at@end@of@kframe{\end{minipage}}%
\begin{minipage}{\columnwidth}%
\fi\fi%
\def\FrameCommand##1{\hskip\@totalleftmargin \hskip-\fboxsep
\colorbox{shadecolor}{##1}\hskip-\fboxsep
% There is no \\@totalrightmargin, so:
\hskip-\linewidth \hskip-\@totalleftmargin \hskip\columnwidth}%
\MakeFramed {\advance\hsize-\width
\@totalleftmargin\z@ \linewidth\hsize
\@setminipage}}%
{\par\unskip\endMakeFramed%
\at@end@of@kframe}
\makeatother
\definecolor{shadecolor}{rgb}{.97, .97, .97}
\definecolor{messagecolor}{rgb}{0, 0, 0}
\definecolor{warningcolor}{rgb}{1, 0, 1}
\definecolor{errorcolor}{rgb}{1, 0, 0}
\newenvironment{knitrout}{}{} % an empty environment to be redefined in TeX
\usepackage{alltt}
\usepackage[sc]{mathpazo}
\usepackage[T1]{fontenc}
\usepackage{geometry}
\geometry{verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=2.5cm,rmargin=2.5cm}
\setcounter{secnumdepth}{1}
\setcounter{tocdepth}{1}
\usepackage{url}
\usepackage[unicode=true,pdfusetitle,
bookmarks=true,bookmarksnumbered=true,bookmarksopen=true,bookmarksopenlevel=2,
breaklinks=false,pdfborder={0 0 1},backref=false,colorlinks=false]
{hyperref}
\hypersetup{
pdfstartview={XYZ null null 1}}
\usepackage{breakurl}
\usepackage{listings}
\lstloadlanguages{Python}
\usepackage{pgfplotstable}
\usepackage{csvsimple}
\begin{document}
\definecolor{keywords}{RGB}{255,0,90}
\definecolor{comments}{RGB}{0,0,113}
\definecolor{red}{RGB}{160,0,0}
\definecolor{green}{RGB}{0,150,0}
\lstset{language=Python,
basicstyle=\ttfamily\small,
keywordstyle=\color{keywords},
commentstyle=\color{comments},
stringstyle=\color{red},
showstringspaces=false,
identifierstyle=\color{green},
procnamekeys={def,class}}
\renewcommand\thesection{\arabic{section}}
%%% for TOC with numbering
\begin
\title{GATK run report}
\author{Vikas Gupta, Niraj Shah and Stig U. Andersen}
\maketitle
\setcounter{secnumdepth}{6}
\setcounter{tocdepth}{6}
\tableofcontents
\newpage
GATK is an acronym for Genome analysis toolkit, a detailed documentation can be found at \url {http://www.broadinstitute.org/gatk/guide/}. \textbf{GATK} pipeline can used for detecting the variants from multi-sample data. One of the problem with the variant detection is the misaligned reads, this issue has been addressed in GATK by considering the per-base quality score which represents the probability that the base called in the read is true sequenced base. These per-base quality scores are quite inaccurate (Mark A DePristo et.al 2011) and depends on the experimental setup and technology used. These errors are eliminated from the GATK using empirically accuarate per-base error model.
\newline
% add dot after number
\makeatletter
\g@addto@macro\thesection.
\makeatother
\section{Introduction}
Here we have utilized sequencing data from MG20, Gifu, Burttii and 28 additional accessions. Using this data we aim to annotate the genetic variants such as SNPs and Indels in the \textit {Lotus japonicus}.
\section{WorkFlow}
\begin{figure}[hb]
\centering
\includegraphics[scale=0.60]{/Users/vgupta/Desktop/03_Lotus_annotation/2013_week35/BPP-wRR-white_small.png}
\caption{\label{GATK workflow}Three sections of GATK pipeline}
\end{figure}
We have divided work here into four major points:
\newline 1. Initial read mapping
\newline 2. Local realignment around indels
\newline 3. Base quality score recalibration
\newline 4. SNP and indel detection
\subsection{Fastq Mapping}
Fastq files were mapped to the Lotus genome 3.0 using the BWA version-0.7.5a-r405 with default parameters and with appropriate insert size for the pair-end libraries. GATK pipeline does not support files without read groups so these were added whenever necessary with Picard \textit {AddOrReplaceReadgroup} function. All the mapped files are placed at genome.au.dk.
\url {/home/vgupta/LotusGenome/100_data/01_Jin_BamFile/02_withReadGroup}.
\subsection{Duplicate Marking}
Duplicates were filtered using Picard toolkit version 1.96. Two things to remember when using Picard for duplicate filtering:
\newline 1. Set \textit{VALIDATION\_STRINGENCY=LENIENT} otherwise Picard will not accept umapped read positions.
\newline 2. Use \textit{TMP\_DIR} variable to a folder where you have sufficient space.\newline
\subsection{Realiging the reads}
Duplicate filtered reads are mapped back on the genome using RealignerTargetCreator and IndelRealigner commands from the GATK. All the fastq files must follow the sanger quality encoding. IndelRealigner locally aligns the reads to minimize the mismatches. Also many reads are misaligned due to presence of insertions and deletions, leading to many false discoveries of SNPs.
\subsection{Unified genotyper}
Realigned bam files are parsed through the unified genotyper to procude a primary list of snp and indel list. -glm BOTH option must be used to predict indels together with SNPs. A higher degree of calling confidence cut-off can be used to insure minimal false positives. Unified genotyper uses a Bayesian genotype likelihood model to find genotypes and allele frequency in a population of N samples. It provides a posterior probability of there being a segregating variant allele at each locus as well as for the genotype of each sample.
\subsection{Base Recalibrator}
In this step, base quality scores are recalibrated. Given a set of high confidence SNPs from the earlier step, program calculates an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations.
% <<plot_PhredQualDistri>>=
% d <- read.table(
% pipe("cut -f6 ~/Desktop/03_Lotus_annotation/2013_week35/snp.90.vcf | grep -v '^#'")
% ,sep='\t',header=T)
% length(d[which(d$QUAL>5000),])
% tdens=density(log10(d$QUAL))
% xlab = "Log10 PhredQual"
% ylab = "Density"
% main = "PhredQual distribution"
% plot(tdens, lwd = 5, col = "red", ylab = ylab, xlab = xlab, main = main,
% cex.main = 2, cex.lab = 2, cex.axis = 2, mar = c(0.5, 0.5, 0.5, 0.5))
% @
\pagebreak
\section{Callable loci}
Genome-wide Callable loci destribution
%
% <<SNP destribution Sliding Window>>=
% jpeg("~/Desktop/03_Lotus_annotation/2013_week35/CallLociWindow.jpg", width=2600,
% height=2000, quality=90, units="px")
% chr=c("chr1","chr2","chr3","chr4","chr5","chr6")
% #chr=c("chr1")
% max_chr_length=67000000
% par(mfcol=c(6,1), mar=c(2.5,2.5,2.5,2.5), oma=c(1,2,1,1))
% tmp <- read.table(
% '~/Desktop/03_Lotus_annotation/2013_week35/20130905.snp.vcf.MG20filteredMoreThanMG20coverage.tbl2.nochr0.w1000.s1000',
% header=F, sep="\t", ,comment.char = "#")
%
% head(tmp)
% names(tmp)
%
% for(i in 1:length(chr)) {
% tmp_chr <- tmp[which(chr[i]==tmp$V1),]
% snpden_smooth <- lowess(tmp_chr$V2,tmp_chr$V3,f=1/10)
% plot(tmp_chr$V2,tmp_chr$V3, type="p", main="snp Density",
% xlim=c(0,max_chr_length), col="red", las=1, ann="FALSE", cex=0.3,
% cex.axis=2)
% #abline(h=0, lwd=2)
% legend("topright",chr[i],cex=2.8,col=1:20,lty=1:20)
% lines(snpden_smooth, lwd=5, col="blue")
% }
% dev.off()
% @
\pagebreak
\section{SNP destribution}
Genome-wide SNP density destribution
% <<SNP destribution>>=
% jpeg("~/Desktop/03_Lotus_annotation/2013_week35/snpDenWindow.jpg", width=2600,
% height=2000, quality=90, units="px")
% chr=c("chr1","chr2","chr3","chr4","chr5","chr6")
% max_chr_length=67000000
% par(mfcol=c(6,1), mar=c(2.5,2.5,2.5,2.5), oma=c(1,2,1,1))
% tmp <- read.table(
% '~/Desktop/03_Lotus_annotation/2013_week35/20130905.snp.vcf.MG20filteredMoreThanMG20coverage.tbl2.nochr0.w1000.s1000',
% header=F, sep="\t", ,comment.char = "#")
%
% head(tmp)
% names(tmp)
%
% for(i in 1:length(chr)) {
% tmp_chr <- tmp[which(chr[i]==tmp$V1),]
% snpden_smooth <- lowess(tmp_chr$V2,tmp_chr$V4,f=1/1000)
% plot(tmp_chr$V2,tmp_chr$V4, type="p", main="snp Density",
% xlim=c(0,max_chr_length), ylim = c(0,0.8), col="red", las=1, ann="FALSE", cex=0.3,
% cex.axis=2)
% #abline(h=0, lwd=2)
% legend("topright",chr[i],cex=2.8,col=1:20,lty=1:20)
% lines(snpden_smooth, lwd=5, col="blue")
% }
% dev.off()
% @
\begin{figure}[hb]
\centering
\includegraphics[scale=0.20]{/Users/vgupta/Desktop/03_Lotus_annotation/2013_week35/snpDen.jpg}
\caption{\label{SNP distribution}SNP distribution}
\end{figure}
\pagebreak
% <<SNP destribution Sliding Window>>=
% jpeg("~/Desktop/03_Lotus_annotation/2013_week35/snpDenWindow.jpg", width=2600,
% height=2000, quality=90, units="px")
% chr=c("chr1","chr2","chr3","chr4","chr5","chr6")
% max_chr_length=67000000
% par(mfcol=c(6,1), mar=c(2.5,2.5,2.5,2.5), oma=c(1,2,1,1))
% tmp <- read.table(
% '~/Desktop/03_Lotus_annotation/2013_week35/snp.90.PhredQual_5000.vcf.snpden',
% header=T, sep="\t", ,comment.char = "#")
%
% head(tmp)
% names(tmp)
%
% for(i in 1:length(chr)) {
% tmp_chr <- tmp[which(chr[i]==tmp$CHROM),]
% snpden <- cbind(tmp_chr$BIN_START,tmp_chr$SNP_COUNT)
% snpden_smooth <- lowess(tmp_chr$BIN_START,tmp_chr$SNP_COUNT,f=1/1000)
% plot(tmp_chr$BIN_START, tmp_chr$SNP_COUNT, type="p", main="snp Density",
% xlim=c(0,max_chr_length), col="red", las=1, ann="FALSE", cex=0.3,
% cex.axis=2)
% #abline(h=0, lwd=2)
% legend("topright",chr[i],cex=2.8,col=1:20,lty=1:20)
% lines(snpden_smooth, lwd=10, col="blue")
% }
% dev.off()
% @
\pagebreak
\section{Heterozygosity destribution}
Genome-wide heterozygosity distribution
% <<Heterozygosity destribution>>=
%
% tmp <- read.table(
% pipe("cut -f 1,2,3,4,5 ~/Desktop/03_Lotus_annotation/2013_week35/sample.hwe"),
% header=T, comment.char = "#",row.names=NULL)
%
% jpeg("~/Desktop/03_Lotus_annotation/2013_week35/Heterozygosity.jpg",
% width=4600, height=6000, quality=90, units="px")
% chr=c("chr1","chr2","chr3","chr4","chr5","chr6")
% max_chr_length=67000000
% par(mfcol=c(18,1), mar=c(2.5,2.5,2.5,2.5), oma=c(1,2,1,1))
% head(tmp)
% names(tmp)
%
% for(i in 1:length(chr)) {
% tmp_chr <- tmp[which(chr[i]==tmp$CHR),]
% homo <- tmp_chr$OBS.HOM1 + tmp_chr$HOM2.
% total_count <- tmp_chr$OBS.HOM1 + tmp_chr$HET + tmp_chr$HOM2.
% fract <- tmp_chr$HET / total_count
%
% length(fract)
%
% plot(tmp_chr$POS, fract, type="p", main="snp Density", xlim=c(0,max_chr_length),
% ylim=c(0,1.2), col="red", las=1, ann="FALSE", cex=0.3, cex.axis=2)
% abline(h=0.5, lwd=2)
% snpden_smooth <- lowess(tmp_chr$POS,tmp_chr$HET,f=1/10000)
% plot(tmp_chr$POS, tmp_chr$HET, type="l", main="snp Density",
% xlim=c(0,max_chr_length), col="light blue", las=1, ann="FALSE", cex=0.3,
% cex.axis=2)
% lines(snpden_smooth, lwd=5, col="blue")
% snpden_smooth <- lowess(tmp_chr$POS,homo,f=1/10000)
% plot(tmp_chr$POS, homo, type="l", main="snp Density", xlim=c(0,max_chr_length),
% col="pink", las=1, ann="FALSE", cex=0.3, cex.axis=2)
% lines(snpden_smooth, lwd=5, col="red")
% #abline(h=0, lwd=2)
% legend("topright",chr[i],cex=2.8,col=1:20,lty=1:20)
% }
% dev.off()
% @
\begin{figure}[hb]
\centering
\includegraphics[scale=0.07]{/Users/vgupta/Desktop/03_Lotus_annotation/2013_week35/Heterozygosity.jpg}
\caption{\label{Heterozygosity distribution}Heterozygosity distribution}
\end{figure}
%\begin{table}
%\pgfplotstabletypeset[
% col sep=tab,
% string type,
% every head row/.style={before row=\hline,after row=\hline},
% every last row/.style={after row=\hline}]{/Users/vgupta/Desktop/03_Lotus_annotation/20130904_rawMappedReads.txt}
%\caption{\label{Reads mapped on Lotus genome}Reads mapped on Lotus genome}
%\end{table}
\pagebreak
\subsection{Mapping on Lotus Genome}
\begin{table}[!htbp]
\centering
\csvautotabular{/Users/vgupta/Desktop/03_Lotus_annotation/20130904_rawMappedReads.txt}
\caption{\label{Reads mapped on Lotus genome}Reads mapped on Lotus genome}
\end{table}
\pagebreak
\subsection{Average coverage on Lotus Genome}
\begin{table}[!htbp]
\centering
\csvautotabular{/Users/vgupta/Desktop/03_Lotus_annotation/2013_week35/Coverage.txt}
\caption{\label{Average Coverage of re-sequenced lines}Average Coverage of re-sequenced lines}
\end{table}
\pagebreak
\subsection{Inbreeding coefficient}
A good example of calculating the F (Inbreeding co-efficient) is explained at \url {http://www.uwyo.edu/dbmcd/popecol/maylects/fst.html}.
Hardy-Weinberg Equilibrium calculates values of p and q based on the dominant alleles and using p and q values calculates expected genotype counts. Based on the diffences on genotypic counts, we can catogerize a popultation as perfect fit, homozygtes excessive (inbreed) or homozygotes deficient (isolate breaking).
\section{Parsing VCF}
Note: Total depth on a given position doesn't add up to sum of individual sample read depth. Explanation is copied from \url {http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_annotator_Coverage.html}.
\newline While the sample-level (FORMAT) DP field describes the total depth of reads that passed the caller's internal quality control metrics (like MAPQ > 17, for example), the INFO field DP represents the unfiltered depth over all samples. Note though that the DP is affected by downsampling (-dcov), so the max value one can obtain for N samples with -dcov D is N * D.
\pagebreak
\subsection{Heterozygosity}
We have measured the level of heterozygosity using the final vcf file produced by GATK. It reports a total of 394,861,527 base-pairs which includes all the callable sites. 5,280,517 markers(SNPs and Indels) were detected. Expect a small fraction of 40,000, all 1.39 million makers from prior analysis were recovered. MG20 echotype has been inbreed for many generations and expected to be highly homozygous but due to error in the mapping and assembly, we found a high fraction of markers were heterozygous across the genome. To remove these potentially false positive markers, we filtered all the markers where genotype call was not 0/0 (reference based homozygous) for the MG20 resulted into a set of 3,989,842 markers among MG20, Gifu, Burttii and 28 re-sequenced. Lines in the analysis seems to highly homozygous hence can be analysed as haploid.
\begin{table}[!htbp]
\centering
\csvautotabular{/Users/vgupta/Desktop/03_Lotus_annotation/2013_week35/20130906_hetero.csv}
\caption{\label{Heterozygocity of all the lines}Heterozygocity of all the lines}
\end{table}
\pagebreak
\subsection{Marker filtering criteria}
Low quality mappings should be filtered before we consolidate a high confident marker list. We need a filtering criteria to remove false positive markers, similar filtering stradegy must be used for also considering the callable fraction of the genome. To startwith we are considering three parameters for the filtering, QUAL (phred-scaled quality score) values, FILTER categories and the MQ (RMS mapping quality).
\newline
\subsubsection{Counting filtering categories}
\newline [vgupta@fe1 02\_withReadGroup]$ cut -f 7\ 20130905.snp.vcf |\ sort |\ uniq -c
\newline 228,849,923\ .
\newline 166,011,565\ LowQual
\newline
\subsubsection{Average QUAL value}
\newline [vgupta@s01n14 02\_withReadGroup]$ grep -v '#' 20130905.snp.vcf | awk '{ if(min==""){min=max=$6}; if(($6)>max) {max=($6)}; if(($6)< min) {min=$6}; total+=($6); count+=1} END {print total,total/count, min, max}'
\newline Total: 3.86946e+12 Average:9,799.54 Min:. Max:34,359,763,362.97
\subsubsection{MG20 homozygous positions}
A total of 394,861,527 positions were reported using Unified Genotyper, of which, 307,385,387 positions were homozygous for MG20. We also filtered the vcf file for all the positions where MG20 was the only sample with the coverage, resulting positions will be refered as Callable Loci which consist a total of 301,306,971 genomic positions.
\subsubsection{Variant annotation}
Using SNPeff \url {http://snpeff.sourceforge.net/SnpEff_manual.html}, we have annotated 1,354,324 snps and 234,209 indels. VCF files for respective data are as following:
\newline \url {/u/pm/data/2013_09_11_lotus_snps/20130905.snp.vcf.markers.snps.inbreed.ref}
\newline \url {/u/pm/data/2013_09_11_lotus_snps/20130905.snp.vcf.markers.indels.inbreed.ref}
\pagebreak
\section{Python Script}
\lstinputlisting [numbers=left,style=Python,caption=GATK pipeline,
label={116_runGATK.py},
breaklines=true,]{/Users/vgupta/Desktop/script/python/116_runGATK.py}
\lstinputlisting [numbers=left,style=Python,caption=vcfParser,
label={119_vcfParser.py},
breaklines=true,]{/Users/vgupta/Desktop/script/python/119_vcfParser.py}
\lstinputlisting [numbers=left,style=Python,caption=classVCF,
label={classVCF.pyy},
breaklines=true,]{/Users/vgupta/Desktop/script/python/classVCF.py}
\end{document}