diff --git a/CHANGELOG.md b/CHANGELOG.md
index e67ce56..d27c8f3 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,21 +1,42 @@
# Change Log
+## [v1.0](https://github.com/IARCbioinfo/needlestack/tree/v1.0) (2016-08-03)
+[Full Changelog](https://github.com/IARCbioinfo/needlestack/compare/v0.3...v1.0)
+
+**Implemented enhancements:**
+
+- Manage the three possible genotypes in vcf [\#130](https://github.com/IARCbioinfo/needlestack/issues/130)
+- The graph showing AF vs log10\(qval\) should show phred-scaled qvalues [\#121](https://github.com/IARCbioinfo/needlestack/issues/121)
+
+**Fixed bugs:**
+
+- Contours seem to be incorrect [\#128](https://github.com/IARCbioinfo/needlestack/issues/128)
+- correct file name extraction for sample name [\#126](https://github.com/IARCbioinfo/needlestack/issues/126)
+- Let min\_qval be equal to 0 [\#119](https://github.com/IARCbioinfo/needlestack/issues/119)
+- plot improved error rate confidence interval [\#117](https://github.com/IARCbioinfo/needlestack/issues/117)
+
+**Closed issues:**
+
+- QUAL should not be reported as Inf in VCF when q-value=0 [\#125](https://github.com/IARCbioinfo/needlestack/issues/125)
+- Add pipeline execution DAG in README [\#123](https://github.com/IARCbioinfo/needlestack/issues/123)
+
## [v0.3](https://github.com/IARCbioinfo/needlestack/tree/v0.3) (2016-05-03)
[Full Changelog](https://github.com/IARCbioinfo/needlestack/compare/v0.2...v0.3)
**Implemented enhancements:**
+- color points by qvalues in regression plot [\#85](https://github.com/IARCbioinfo/needlestack/issues/85)
+- Add an option to directly input a region for calling in the command line [\#71](https://github.com/IARCbioinfo/needlestack/issues/71)
+- Improve the bed split method [\#47](https://github.com/IARCbioinfo/needlestack/issues/47)
- Change the number of entry in the INFO and FORMAT VCF fields [\#108](https://github.com/IARCbioinfo/needlestack/issues/108)
- Add contour lines for a set of qvalues in the plot [\#100](https://github.com/IARCbioinfo/needlestack/issues/100)
-- color points by qvalues in regression plot [\#85](https://github.com/IARCbioinfo/needlestack/issues/85)
- Add an option to choose output VCF file name \(--out\_vcf?\) [\#81](https://github.com/IARCbioinfo/needlestack/issues/81)
-- Add an option to directly input a region for calling in the command line [\#71](https://github.com/IARCbioinfo/needlestack/issues/71)
- Change the way we publish new version [\#69](https://github.com/IARCbioinfo/needlestack/issues/69)
- Make the stable docker file more stable [\#68](https://github.com/IARCbioinfo/needlestack/issues/68)
- Add more tests in CircleCI [\#55](https://github.com/IARCbioinfo/needlestack/issues/55)
- Remove unnecessary intermediate outputs [\#51](https://github.com/IARCbioinfo/needlestack/issues/51)
-- Improve the bed split method [\#47](https://github.com/IARCbioinfo/needlestack/issues/47)
- In the absence of a bed file the pipeline should run on the full reference genome [\#39](https://github.com/IARCbioinfo/needlestack/issues/39)
+- Improve R script command line parsing [\#38](https://github.com/IARCbioinfo/needlestack/issues/38)
- Add version numbers in VCF output [\#20](https://github.com/IARCbioinfo/needlestack/issues/20)
**Fixed bugs:**
@@ -34,7 +55,6 @@
**Implemented enhancements:**
- Add logo image [\#62](https://github.com/IARCbioinfo/needlestack/issues/62)
-- Improve R script command line parsing [\#38](https://github.com/IARCbioinfo/needlestack/issues/38)
- add --no\_indel option [\#56](https://github.com/IARCbioinfo/needlestack/issues/56)
- Correct english typos in readme, help and log [\#53](https://github.com/IARCbioinfo/needlestack/issues/53)
- The pipeline randomly crashes with java.nio.file.NoSuchFileException: XXX\_empty.pdf [\#49](https://github.com/IARCbioinfo/needlestack/issues/49)
diff --git a/README.md b/README.md
index ad0c05b..93db697 100644
--- a/README.md
+++ b/README.md
@@ -16,8 +16,9 @@ Contact: follm@iarc.fr
Needlestack is an ultra-sensitive multi-sample variant caller for Next Generation Sequencing (NGS) data. It is based on the idea that analysing several samples together can help estimate the distribution of sequencing errors to accurately identify variants. It has been initially developed for somatic variant calling using very deep NGS data from circulating free DNA, but is also applicable to lower coverage data like Whole Exome Sequencing (WES) or even Whole Genome Sequencing (WGS). It is a highly scalable and reproducible pipeline thanks to the use of [nextflow](http://www.nextflow.io/) and [docker](https://www.docker.com) technologies.
Here is a summary of the method:
+
- At each position and for each candidate variant, we model sequencing errors using a negative binomial regression with a linear link and a zero intercept. The data is extracted from the BAM files using [samtools](http://www.htslib.org).
-- Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use robust estimator for the negative binomial regression (published [here](http://www.ncbi.nlm.nih.gov/pubmed/25156188) with code available [here](https://github.com/williamaeberhard/glmrob.nb)).
+- Genetic variants are detected as being outliers from the error model. To avoid these outliers biasing the regression we use a robust estimator for the negative binomial regression (published [here](http://www.ncbi.nlm.nih.gov/pubmed/25156188) with code available [here](https://github.com/williamaeberhard/glmrob.nb)).
- We calculate for each sample a p-value for being a variant (outlier from the regression) that we further transform into q-values to account for multiple testing.
## Input
@@ -93,6 +94,8 @@ Needlestack works under most Linux distributions and Apple OS X.
## Detailed instructions
+### Nextflow and Docker
+
If you can't install [docker](https://www.docker.com) or don't want to use it, the pipeline will also work if you install [perl](https://www.perl.org), [bedtools](http://bedtools.readthedocs.org/en/latest/), [samtools](http://www.htslib.org) and Rscript from [R](https://www.r-project.org) and put them in your path (executables are assumed to be respectively called `perl`, `bedtools`, `samtools` and `Rscript`). In this case, remove the `-with-docker` option from step 5 above.
The exact same pipeline can be run on your computer or on a HPC cluster, by adding a [nextflow configuration file](http://www.nextflow.io/docs/latest/config.html) to choose an appropriate [executor](http://www.nextflow.io/docs/latest/executor.html). For example to work on a cluster using [SGE scheduler](https://en.wikipedia.org/wiki/Oracle_Grid_Engine), simply add a file named `nextflow.config` in the current directory (or `~/.nextflow/config` to make global changes) containing:
@@ -100,22 +103,21 @@ The exact same pipeline can be run on your computer or on a HPC cluster, by addi
process.executor = 'sge'
```
-Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation [here](http://www.nextflow.io/docs/latest/executor.html) for more details. Also have a look at the [other parameters for the executors](http://www.nextflow.io/docs/latest/config.html#scope-executor), in particular `queueSize` that defines the number of tasks the executor will handle in a parallel manner. Parallelism in needlestack is managed by splitting the genomic regions in pieces of equal sizes (`--nsplit`). Note that dealing with very large regions can take a large amount of memory, therefore splitting more is more memory-efficient. In nextflow the default number of tasks the executor will handle in a parallel is 100, which is certainly too high if you are executing it on your local machine (as if you use `--nsplit 100` the 100 pieces will run in parallel). In this case a good idea is to set it to the number of computing cores your local machine has. You can add this as an option at run time, by adding for example `-queue-size 4` in the `nextflow run` command if you have a machine with four cores. You can also permanently set it in the config file, here is a way to automatically obtain and add this information (works on Linux and Mac OS X):
-```bash
-echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` >> ~/.nextflow/config
-```
+Other popular schedulers such as LSF, SLURM, PBS, TORQUE etc. are also compatible. See the nextflow documentation [here](http://www.nextflow.io/docs/latest/executor.html) for more details. Also have a look at the [other parameters for the executors](http://www.nextflow.io/docs/latest/config.html#scope-executor), in particular `queueSize` that defines the number of tasks the executor will handle in a parallel manner. Parallelism in needlestack is managed by splitting the genomic regions in pieces of equal sizes (`--nsplit`). Note that dealing with very large regions can take a large amount of memory, therefore splitting more is more memory-efficient.
+
+### Parameters
-`--bam_folder` and `--fasta_ref` are compulsary. The optional parameters with default values are:
+Type `--help` to get the full list of options. `--bam_folder` and `--fasta_ref` are compulsary. The optional parameters with default values are:
| Parameter | Default value | Description |
|-----------|--------------:|-------------|
-| min_dp | 50 | Minimum coverage in at least one sample to consider a site |
-| min_ao | 5 | Minimum number of non-ref reads in at least one sample to consider a site |
+| min_dp | 30 | Minimum median coverage to consider a site. In addition, at least 10 samples have to be covered by min_dp. |
+| min_ao | 3 | Minimum number of non-ref reads in at least one sample to consider a site |
| nsplit | 1 | Split the bed file in nsplit pieces and run in parallel |
| min_qval | 50 | qvalue threshold in [Phred scale](https://en.wikipedia.org/wiki/Phred_quality_score) to consider a variant |
-| sb_type | SOR | Strand bias measure, either SOR or RVSB |
-| sb_snv | 100 | Strand bias threshold for SNVs (100 =no filter) |
-| sb_indel | 100 | Strand bias threshold for indels (100 = no filter)|
+| sb_type | SOR | Strand bias measure, either SOR, RVSB or FS |
+| sb_snv | 100 or 1000 | Strand bias threshold for SNVs (100 (1000 if FS) = no filter) |
+| sb_indel | 100 or 1000 | Strand bias threshold for indels (100 (1000 if FS) = no filter)|
| map_qual | 20 | Min mapping quality (passed to samtools) |
| base_qual | 20 | Min base quality (passed to samtools) |
| max_DP | 30000 | Downsample coverage per sample (passed to samtools) |
@@ -129,12 +131,39 @@ echo "executor.\$local.queueSize = "`getconf _NPROCESSORS_ONLN` >> ~/.nextflow/c
| out_vcf | all_variants.vcf | File name of final VCF |
| bed | | BED file containing a list of regions (or positions) where needlestack should be run |
| region | | A region in format CHR:START-END where calling should be done |
+| pairs_file | | A tab-delimited file containing two columns (normal and tumor sample names) for each sample in line. This enables matched tumor/normal pair calling features (see below) |
+| power_min_af | | Allelic fraction used to classify genotypes to 0/0 or ./. depending of the power to detect a variant at this fraction (see below) |
+| extra_robust_gl | | Add this argument to perform extra-robust regression (useful for common germline SNPs, see below) |
+| sigma_normal | 0.1 | Sigma parameter for negative binomial modeling of expected germline allelic fraction. We strongly recommend not to change this parameter unless you really know what it means |
+| input_vcf | | A VCF file (from GATK) where calling should be done. Needlestack will extract DP and AO from this VCF (DP and AD fields) and annotate it with phred q-value score (`FORMAT/QVAL` field), error rate (`INFO/ERR`) and over-dispersion sigma parameter (`INFO/SIG`). WARNING: by default, only work with split (coreutils) version > 8.13 |
By default, if neither `--bed` nor `--region` are provided, needlestack would run on whole reference, building a bed file from fasta index inputed.
If `--bed` and `--region` are both provided, it should run on the region only.
Simply add the parameters you want in the command line like `--min_dp 1000` for example to change the min coverage or `--all_SNVs` to output all sites.
-[Recommended values](http://gatkforums.broadinstitute.org/discussion/5533/strandoddsratio-computation) for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. For RVSB, a good starting point is to filter out variant with RVSB>0.85. There is no hard filter by default as this is easy to do afterward using [bcftools filter](http://samtools.github.io/bcftools/bcftools.html#filter) command.
+### Germline, somatic, matched Tumor-Normal pairs calling and contamination
+
+When using matched tumor/normal, Needlestack can classify variants (VCF `FORMAT/STATUS` field) according to the following table:
+![Status Table](STATUS_TABLE.png)
+
+For this one need to provide a tab-delimited file containing two columns with normal and tumor sample names using the `--pairs_file` option. The first line of this file is a header with `TUMOR` and `NORMAL` keywords. When one normal or one tumor is missing, one can write `NA`. In this mode, the parameter `power_min_af` defines the allelic fraction in the tumor one is trying to detect to classify genotypes as `./.` or `0/0` depending on the power to detect this allelic fraction. Variants found as somatic in a tumor, but germline in another sample of the series will be flagged as `POSSIBLE_CONTAMINATION`. We found this particularly important, as needlestack is very sensitive to low allelic fractions, to filter out contamination among samples for pooled exome capture.
+
+In other cases (when there is no `--pairs_file` parameter defined), genotypes are defined as `./.` or `0/0` assuming one is looking for allelic fractions expected for germline variants (negative binomial distribution centered at 0.5 with over-dispersion parameter sigma=`sigma_normal`, with `sigma_normal=0.1` by default). If you are looking for somatic variants without matched-normal and assuming you are interesting to correctly distinguish `./.` and `0/0`genotypes, you can set the `power_min_af` parameter to the lowest allelic fraction of somatic variants you are interested with (and your coverage allows you to find). Note that this is by far not the most common situation, and that in most cases you don't have to worry about the `power_min_af` parameter.
+
+## Notes
+
+### Common variants
+
+Needlestack is made to identify rare variants (i.e. only a few samples in your set of samples have a particular variant), because of the robust regression used. Therefore, common SNPs (>10%) or strong somatic hotspots will be missed. The optional `extra_robust_gl` can overcome partially this issue for common germline mutations: it first discard high allelic fraction (>20%, assuming these are likely true variants) before fitting the regression model when between 10% and 50% of sample have such a high allelic fraction. A flag is written in the VCF `INFO/WARN` field when this happened (`EXTRA_ROBUST_GL`). Additionally, when an other allele than the reference allele is the most common, it is taken as the reference allele and a flag is also written in the VCF (`INFO/WARN=INV_REF`).
+
+### Strand bias
+
+For conventional variant callers, GATK WES/WGS [recommended values](http://gatkforums.broadinstitute.org/discussion/5533/strandoddsratio-computation) for SOR strand bias are SOR < 4 for SNVs and < 10 for indels. We haven't found this particularly useful, but a more detailed evaluation is necessary. For amplicon based targeted sequencing, RVSB>0.85 seems to reduce some erros. There is no hard filter by default as this is easy to do afterward using [bcftools filter](http://samtools.github.io/bcftools/bcftools.html#filter) command.
+
+### Reproducibility
A good practice is to keep (and publish) the `.nextflow.log` file create during the pipeline process, as it contains useful information for reproducibility (and for debugging in case of problem). You should keep the `trace.txt` file containing even more information to keep for records. Nextflow also creates a nice processes execution timeline file (a web page) in `timeline.html`.
+
+## Pipeline execution DAG
+
diff --git a/STATUS_TABLE.png b/STATUS_TABLE.png
new file mode 100644
index 0000000..e864db3
Binary files /dev/null and b/STATUS_TABLE.png differ
diff --git a/STATUS_TABLE.tex b/STATUS_TABLE.tex
new file mode 100755
index 0000000..f228eb3
--- /dev/null
+++ b/STATUS_TABLE.tex
@@ -0,0 +1,130 @@
+%& -shell-escape
+\documentclass[11pt]{article}
+
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{lmodern}
+
+%\usepackage[latin1]{inputenc}
+\usepackage[english]{babel}
+\usepackage[paperwidth=18cm, paperheight=10cm,margin=0.5cm]{geometry}
+\usepackage{helvet}
+\usepackage{times}
+%\usepackage{subfigure}
+\usepackage{enumitem}
+
+%\usepackage{unnumberedcaption}
+\usepackage{multirow}
+\usepackage{graphics,graphicx,epsfig,rotating}
+\usepackage{amsfonts,amsmath,supertabular,tabularx,amstext}
+\usepackage[table]{xcolor}
+\usepackage{color}
+\usepackage{verbatim}
+\usepackage{stmaryrd}
+\usepackage{array}
+\usepackage{subfig}
+%\usepackage[noae]{Sweave}
+
+\usepackage{longtable}
+\usepackage{threeparttable}
+
+\usepackage[]{natbib}
+\bibpunct{(}{)}{;}{author-year}{}{,}
+
+\RequirePackage{fancyvrb}
+\RequirePackage{listings}
+
+\usepackage{hyperref}
+\usepackage[right]{lineno}
+
+\rmfamily
+
+\definecolor{lightgray}{rgb}{0.75,0.75,0.75}
+
+%\oddsidemargin 0cm
+%\evensidemargin 0cm
+%%\textwidth 17.2cm
+%\topmargin 0cm
+%\textheight 10.2cm
+%\textwidth 18cm
+%\marginparsep 0pt
+\renewcommand{\baselinestretch}{1}
+%color
+
+\usepackage{titlesec}
+\titleformat{\section}{\Large\bfseries}{}{0pt}{}
+\titleformat{\subsection}{\Large\itshape}{}{0pt}{}
+
+
+
+
+\begin{document}
+\definecolor{red2}{rgb}{1,0.5,0.5}
+\definecolor{lightred}{rgb}{1,0.8,0.8}
+
+\definecolor{lightblue}{rgb}{0,0.60,0.86}
+
+\definecolor{redblue}{rgb}{0.3,0.7,0.43}
+
+\definecolor{darkgreen}{rgb}{0.1,0.6,0.1}
+\definecolor{Lightgray}{rgb}{0.95,0.95,0.95}
+\definecolor{lightgray}{rgb}{0.75,0.75,0.75}
+\definecolor{darkgray}{rgb}{0.35,0.35,0.35}
+\definecolor{darkgray2}{rgb}{0.90,0.90,0.90}
+\definecolor{darkred}{rgb}{0.6,0.1,0.1}
+
+%\begin{titlepage}
+
+
+
+
+\setcounter{page}{1}
+
+%\tableofcontents
+%\newpage
+
+
+%\newpage
+%%%%%%%%%%%%%%%%% citations %%%%%%%%%%%%%%%%%%%
+\setcounter{equation}{0}
+\setcounter{figure}{0}
+\setcounter{section}{0}
+\renewcommand{\thefigure}{S\arabic{figure}}
+\renewcommand{\thetable}{S\arabic{table}}
+\renewcommand{\theequation}{S\arabic{equation}}
+
+%\newpage
+
+
+\begin{footnotesize}
+\begin{threeparttable}
+\begin{tabular}[htp!]{m{1.4cm} m{1.7cm} m{1.4cm} m{1.7cm} m{4.5cm} m{1.4cm} m{1.4cm}}
+\caption{\textbf{Table of variant STATUS and GT (genotype), as a function of variant detection and the power to detect variants in matched normal and tumor samples.}}\\
+\hline
+\textbf{Variant in Normal} & \textbf{Power to detect variant in Normal} & \textbf{Variant in Tumor} & \textbf{Power to detect variant in Tumor} & \textbf{STATUS} & \textbf{Normal GT} & \textbf{Tumor GT}\\
+\hline
+\rowcolor{lightgray}no & no & no& no& . & ./. & ./. \tnote{*}\\
+no & no & no& \textbf{YES}& . & ./. & 0/0 \tnote{*}\\
+\rowcolor{lightgray}no & no & \textbf{YES}& \textbf{YES} or no & UNKNOWN & ./. & 0/1 or 1/1 \\
+no & \textbf{YES} & no& no& . & 0/0 & ./. \tnote{*}\\
+\rowcolor{lightgray}no & \textbf{YES} & no& \textbf{YES}& . & 0/0 & 0/0 \tnote{*}\\
+no & \textbf{YES} & \textbf{YES}& \textbf{YES} or no & SOMATIC \tnote{\textdagger} & 0/0 & 0/1 or 1/1 \\
+\rowcolor{lightgray}\textbf{YES} & \textbf{YES} or no & no& no& GERMLINE\_UNCONFIRMABLE & 0/1 or 1/1 & ./. \\
+\textbf{YES} & \textbf{YES} or no & no& \textbf{YES}& GERMLINE\_UNCONFIRMED & 0/1 or 1/1 & 0/0 \\
+\rowcolor{lightgray}\textbf{YES} & \textbf{YES} or no & \textbf{YES}& \textbf{YES} or no & GERMLINE\_CONFIRMED & 0/1 or 1/1 & 0/1 or 1/1\\
+\hline
+\label{tab:sumstats}
+\end{tabular}
+\begin{tablenotes}
+\item[*] power is computed using a binomial distribution with mean $power\_min\_af$ (default value is 0.01)
+\item[\textdagger] only in Tumor; Normal STATUS is ``."
+\end{tablenotes}
+
+\end{threeparttable}
+
+\end{footnotesize}
+
+
+
+\end{document}
+
diff --git a/bin/annotate_vcf.r b/bin/annotate_vcf.r
new file mode 100755
index 0000000..6394030
--- /dev/null
+++ b/bin/annotate_vcf.r
@@ -0,0 +1,142 @@
+#! /usr/bin/env Rscript
+
+# Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see .
+
+args <- commandArgs(TRUE)
+parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
+argsL <- as.list(as.character(as.data.frame(do.call("rbind", parseArgs(args)))$V2))
+names(argsL) <- as.data.frame(do.call("rbind", parseArgs(args)))$V1
+args <- argsL;rm(argsL)
+
+if(is.null(args$input_vcf)) {stop("no input VCF file")} else {input_vcf = args$input_vcf}
+if(is.null(args$out_vcf)) {args$out_vcf = out_vcf = paste(gsub(".vcf.bgz","",input_vcf),"_annotated_needlestack.vcf",sep="")} else {out_vcf=args$out_vcf}
+if(nchar(args$out_vcf)==0) {out_vcf = paste(gsub(".vcf.bgz","",input_vcf),"_annotated_needlestack.vcf",sep="")}
+if(is.null(args$chunk_size)) {chunk_size = 1000} else {chunk_size = as.numeric(args$chunk_size)}
+if(is.null(args$do_plots)) {do_plots = TRUE} else {do_plots = as.logical(args$do_plots)}
+if(is.null(args$plot_labels)) {plot_labels = TRUE} else {plot_labels = as.logical(args$plot_labels)}
+if(is.null(args$add_contours)) {add_contours = TRUE} else {add_contours = as.logical(args$add_contours)}
+if(is.null(args$min_coverage)) {min_coverage = 50} else {min_coverage = as.numeric(args$min_coverage)}
+if(is.null(args$min_reads)) {min_reads = 5} else {min_reads = as.numeric(args$min_reads)}
+if(is.null(args$GQ_threshold)) {GQ_threshold=50} else {GQ_threshold = as.numeric(args$GQ_threshold)}
+if(is.null(args$SB_threshold)) {SB_threshold=100} else {SB_threshold = as.numeric(args$SB_threshold)}
+if(is.null(args$extra_rob)) {extra_rob=FALSE} else {extra_rob=as.logical(args$extra_rob)}
+
+source(paste(args$source_path,"glm_rob_nb.r",sep=""))
+source(paste(args$source_path,"plot_rob_nb.r",sep=""))
+library(VariantAnnotation)
+
+#initiate the first chunk
+vcf <- open(VcfFile(input_vcf, yieldSize=chunk_size))
+vcf_chunk = readVcf(vcf, "hg19")
+
+#and continue
+while(dim(vcf_chunk)[1] != 0) {
+ # coverage (matrix of integers)
+ DP_matrix = geno(vcf_chunk,"DP")
+ # AO counts (matrix of lists of integers)
+ AD_matrix = geno(vcf_chunk,"AD")
+
+ #compute regressions and qvals,err,sig
+ reg_list = lapply(1:dim(vcf_chunk)[1], function(var_line) { #for each line of the chunk return a list of reg for each AD
+ # replace NAs and integer(0) by correct number of 0 ADs
+ AD_matrix[var_line, which(is.na(AD_matrix[var_line,]))] = lapply(AD_matrix[var_line, which(is.na(AD_matrix[var_line,]))], function(x) x=as.vector(rep(0,max(lengths(AD_matrix[var_line,])))))
+ AD_matrix[var_line,] = lapply(AD_matrix[var_line,], function(x) {if(length(x)==0) { x=as.vector(rep(0, ifelse(max(lengths(AD_matrix[var_line,]),na.rm = T) >0, max(lengths(AD_matrix[var_line,]),na.rm = T), 2) )) } else {x=x} } )
+ lapply(2:max(lengths(AD_matrix[var_line,])), function(AD_index) { #for each alternative
+ DP=DP_matrix[var_line,]
+ AO=unlist(lapply(AD_matrix[var_line,],"[[",AD_index)) #AD_matrix[var_line,] is a list of AD for each sample, here return list of ADs(i) for alt i
+ if( sum( (AO/DP) > 0.8 , na.rm = T) > 0.5*length(AO) ){ #test reference switching
+ AO = DP - unlist(lapply(1:length(DP), function(i) sum(unlist(AD_matrix[var_line,i])[2:length(unlist(AD_matrix[var_line,i]))]))) #compute AO(ref)
+ inv_ref = T
+ } else { inv_ref = F }
+ reg_res=glmrob.nb(x=DP,y=AO,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob)
+ reg_res$inv_ref = inv_ref
+ if (do_plots) {
+ chr=as.character(seqnames(rowRanges(vcf_chunk,"seqnames"))[var_line])
+ loc=start(ranges(rowRanges(vcf_chunk,"seqnames"))[var_line])
+ ref=as.character(ref(vcf_chunk)[[var_line]])
+ alts=as.character(alt(vcf_chunk)[[var_line]])
+ alts_long_name = alts[nchar(alts)>20] #if long alt, take only extremities with a length depending on the index of the alt
+ alts[nchar(alts)>20]=paste(substr(alts_long_name,1,5+match(alts_long_name,alts)),substr(alts_long_name,nchar(alts_long_name)-(5+match(alts_long_name,alts)),nchar(alts_long_name)),sep="...")
+ alt=paste(alts,collapse = ",")
+ sbs=rep(NA,dim(vcf_chunk)[2])
+ pdf(paste(chr,"_",loc,"_",loc,"_",ref,"_",alt,"_",AD_index-1,ifelse(reg_res$inv_ref,"_inv_ref",""),ifelse(reg_res$extra_rob,"_extra_robust",""),".pdf",sep=""),7,6)
+ plot_rob_nb(reg_res, 10^-(GQ_threshold/10), plot_title=bquote(paste(.(chr),":",.(loc)," (",.(ref) %->% .(alt),"[",.(AD_index-1),"]",")",.(ifelse(reg_res$extra_rob," EXTRA ROBUST","")),sep="")), sbs=sbs, SB_threshold=SB_threshold,plot_labels=T,add_contours=T,names=samples(header(vcf_chunk)))
+ dev.off()
+ }
+ reg_res
+ })
+ })
+ qvals = lapply(reg_list, function(regs) {
+ lapply(regs, function(reg) (unlist(reg["GQ"])+0)) #here add +0 to avoid sprintf wrinting -0
+ })
+ err = lapply(reg_list, function(regs) {
+ lapply(regs, function(reg) unlist(reg$coef["slope"]))
+ })
+ sig = lapply(reg_list, function(regs) {
+ lapply(regs, function(reg) unlist(reg$coef["sigma"]))
+ })
+ extra_robust_gl = lapply(reg_list, function(regs) {
+ lapply(regs, function(reg) unlist(reg$extra_rob)) #if at least on regression at the position is extra_rob
+ })
+ inv_refs = lapply(reg_list, function(regs) {
+ lapply(regs, function(reg) unlist(reg$inv_ref)) #if at least on regression at the position is inv_ref
+ })
+
+ #annotate the header of the chunk
+ info(header(vcf_chunk))["ERR",]=list("A","Float","Error rate estimated by needlestack")
+ info(header(vcf_chunk))["SIG",]=list("A","Float","Dispertion parameter estimated by needlestack")
+ info(header(vcf_chunk))["WARN",]=list("A","Character","Warning message when position is processed specifically by needlestack")
+ geno(header(vcf_chunk))["QVAL",]=list("A","Float","Phred q-values computed by needlestack")
+ geno(header(vcf_chunk))["QVAL_INV",]=list("A","Float","Phred q-values computed by needlestack at a position where ")
+
+ #annotate the chunk with computed values
+ #add WARN INFO field if extra-robust or inverted-reference
+ info(vcf_chunk)$WARN = rep(NA, dim(vcf_chunk)[1])
+ extra_rob_pos = which(unlist(lapply(extra_robust_gl, function(l) Reduce("|",l))==TRUE))
+ info(vcf_chunk)$WARN[extra_rob_pos]=unlist(lapply(extra_rob_pos, function(i) {
+ ex=unlist(extra_robust_gl[i])
+ ex[which(ex==TRUE)]="EXTRA_ROBUST_GL"; ex[which(ex==FALSE)]="."
+ paste(ex, collapse = ",") } ))
+ inv_refs_pos = which(unlist(lapply(inv_refs, function(l) Reduce("|",l))==TRUE))
+ info(vcf_chunk)$WARN[inv_refs_pos]=unlist(lapply(inv_refs_pos, function(i) {
+ ex=unlist(inv_refs[i])
+ ex[which(ex==TRUE)]="INV_REF"; ex[which(ex==FALSE)]="."
+ paste(ex, collapse = ",") } ))
+ inv_refs_extra_rob_pos = which(unlist(lapply(inv_refs, function(l) Reduce("|",l))==TRUE) & unlist(lapply(inv_refs, function(l) Reduce("&",l))==TRUE))
+ info(vcf_chunk)$WARN[inv_refs_extra_rob_pos]=unlist(lapply(inv_refs_extra_rob_pos, function(i) {
+ ex=unlist(inv_refs[i]) #we know that if inv_ref == TRUE then extra_robust = TRUE
+ ex[which(ex==TRUE)]="EXTRA_ROBUST_GL/INV_REF"; ex[which(ex==FALSE)]="."
+ paste(ex, collapse = ",") } ))
+ #compute other fields
+ info(vcf_chunk)$ERR = NumericList(err)
+ info(vcf_chunk)$SIG = NumericList(sig)
+ geno(vcf_chunk)$QVAL = matrix(data = unlist(lapply(1:length(qvals), function(i) {
+ q=qvals[[i]]
+ if(i %in% inv_refs_pos) q=lapply(1:length(q), function(j) { x=q[[j]] ; if(inv_refs[[i]][[j]] == TRUE) x[]=NA ; x }) #replace QVAL by "." if INV_REF
+ as.list(data.frame(t(mapply(c,q))))
+ }),recursive = FALSE), nrow = dim(vcf_chunk)[1], byrow = TRUE)
+ geno(vcf_chunk)$QVAL_INV = matrix(data = unlist(lapply(1:length(qvals), function(i) {
+ q=qvals[[i]]
+ if(i %in% inv_refs_pos) q=lapply(1:length(q), function(j) { x=q[[j]] ; if(inv_refs[[i]][[j]] == FALSE) x[]=NA ; x }) #replace QVAL_INV by "." if not INV_REF
+ as.list(data.frame(t(mapply(c,q))))
+ }),recursive = FALSE), nrow = dim(vcf_chunk)[1], byrow = TRUE)
+
+ #write out the annotated VCF
+ con = file(out_vcf, open = "a")
+ writeVcf(vcf_chunk, con)
+ vcf_chunk = readVcf(vcf, "hg19")
+ close(con)
+}
diff --git a/bin/glm_rob_nb.r b/bin/glm_rob_nb.r
new file mode 100644
index 0000000..1f0e462
--- /dev/null
+++ b/bin/glm_rob_nb.r
@@ -0,0 +1,224 @@
+# Copyright (C) 2017 Matthieu Foll and Tiffany Delhomme
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see .
+
+glmrob.nb <- function(y,x,bounding.func='T/T',c.tukey.beta=5,c.tukey.sig=3,c.by.beta=4,weights.on.x='none',
+ minsig=1e-3,maxsig=10,minmu=1e-10,maxmu=1e5,maxit=30,maxit.sig=50,sig.prec=1e-8,tol=1e-6,
+ n_ai.sig.tukey=100,n_xout=10^4,min_coverage=1,min_reads=1,size_min=10,extra_rob=TRUE,min_af_extra_rob=0.2,min_prop_extra_rob=0.1,max_prop_extra_rob=0.5,...){
+
+ maxmu = max(x)
+
+ if(extra_rob & length(x[which( (y/x) > min_af_extra_rob)]) > min_prop_extra_rob*length(x) & length(x[which( (y/x) > min_af_extra_rob)]) < max_prop_extra_rob*length(x)){
+ extra_rob_out = TRUE
+ x_not_in_reg = x[which( (y/x) > min_af_extra_rob)]
+ y_not_in_reg = y[which( (y/x) > min_af_extra_rob)]
+ pos_not_in_reg = which( (y/x) > min_af_extra_rob)
+ if(sum(!(1:length(x) %in% pos_not_in_reg))>size_min){ #if not enought samples after removing, do not perform extra-robust regression
+ x = x[!(1:length(x) %in% pos_not_in_reg)]
+ y = y[!(1:length(y) %in% pos_not_in_reg)]
+ } else {extra_rob_out = FALSE}
+ } else {extra_rob_out = FALSE}
+
+ if(extra_rob_out) min_reads=0
+
+ if (median(x, na.rm=T)min_coverage, na.rm=T)0)==0 ) {
+ if(extra_rob_out) { #before return NA re-add removed samples
+ coverage = ma_count = rep(0,length(y)+length(pos_not_in_reg))
+ ma_count[setdiff(1:(length(y)+length(pos_not_in_reg)),pos_not_in_reg)]=y
+ coverage[setdiff(1:(length(x)+length(pos_not_in_reg)),pos_not_in_reg)]=x
+ ma_count[pos_not_in_reg]=y_not_in_reg
+ coverage[pos_not_in_reg]=x_not_in_reg
+ y=ma_count; x=coverage
+ }
+ return(res=list("coverage"=x, "ma_count"=y, "coef"=c(sigma=NA,slope=NA), "pvalues"=rep(1,l=length(y)), "qvalues"=rep(1,l=length(y)),"GQ"=rep(0,l=length(y)),"extra_rob"=extra_rob_out))
+ }
+ ### Written by William H. Aeberhard, February 2014
+ ## Disclaimer: Users of these routines are cautioned that, while due care has been taken and they are
+ ## believed accurate, they have not been rigorously tested and their use and results are
+ ## solely the responsibilities of the user.
+ ### Modified by Matthieu Foll and Tiffany Delhomme, 2015 (follm@iarc.fr)
+ #-------------------------------------------------------------------
+ # General set up
+ #-------------------------------------------------------------------
+ yy <- y
+ xx <- x
+ y <- y[which(x>0)]
+ x <- x[which(x>0)]
+ n <- length(y)
+ X <- matrix(log(x))
+ if (dim(X)[1]!=n){stop('length(y) does not match dim(X)[1].')}
+ onevec <- rep(1,n)
+ if (identical(X[,1],onevec)){X <- X[,-1]}
+ res <- list()
+ #-------------------------------------------------------------------
+ # initial estimates: MLEs for beta and sigma
+ #-------------------------------------------------------------------
+ invlink <- function(eta){exp(eta)}
+ derivlink <- function(mu){1/mu}
+ varfunc <- function(mu,sig){mu+sig*mu^2}
+ # initial mu computed through median outliers method
+ af=y/exp(X)
+ af[which(y==0)] <- 0
+ inislope <- mean(af[which(af<=quantile(af,0.75)+1.5*diff(quantile(af,c(0.25,0.75))))])
+ if(inislope<=1/(10*mean(exp(X)))){inislope<-1/(10*mean(exp(X)))}
+ mu <- inislope*exp(X)
+ mu[which(mu>maxmu)] <- maxmu
+ mu[which(mumaxsig){sig <- maxsig}
+ if (sigc.tukey,0,((r/c.tukey)^2-1)^2*r)
+ }
+ E.tukeypsi.1 <- function(mui,sig,c.tukey){
+ sqrtVmui <- sqrt(varfunc(mui,sig))
+ j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
+ j2 <- floor(mui+c.tukey*sqrtVmui)
+ if (j1>j2){0}
+ else {
+ if (j2-j1+1 > 2*n_ai.sig.tukey) {
+ j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
+ yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)*dnbinom(j12,mu=mui,size=1/sig)
+ integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]]/sqrtVmui
+ } else {
+ j12 <- j1:j2
+ sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)*dnbinom(j12,mu=mui,size=1/sig))/sqrtVmui
+ }
+ }
+ }
+ E.tukeypsi.2 <- function(mui,sig,c.tukey){
+ sqrtVmui <- sqrt(varfunc(mui,sig))
+ j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
+ j2 <- floor(mui+c.tukey*sqrtVmui)
+ if (j1>j2){0}
+ else {
+ if (j2-j1+1 > 2*n_ai.sig.tukey) {
+ j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
+ yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)^2*dnbinom(j12,mu=mui,size=1/sig)
+ integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]]/sqrtVmui
+ } else {
+ j12 <- j1:j2
+ sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)^2*dnbinom(j12,mu=mui,size=1/sig))/sqrtVmui
+ }
+ }
+ }
+ ai.sig.tukey <- function(mui,sig,c.tukey){
+ psi.sig.ML.mod <- function(j,mui,invsig){
+ digamma(j+invsig)-digamma(invsig)-log(mui/invsig+1)-(j-mui)/(mui+invsig)
+ }
+ sqrtVmui <- sqrt(mui*(sig*mui+1))
+ invsig <- 1/sig
+ j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
+ j2 <- floor(mui+c.tukey*sqrtVmui)
+ if (j1>j2){0}
+ else {
+ if (j2-j1+1 > 2*n_ai.sig.tukey) {
+ j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
+ yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*psi.sig.ML.mod(j=j12,mui=mui,invsig=invsig)*dnbinom(x=j12,mu=mui,size=invsig)
+ integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]] #xout=j1:j2
+ } else {
+ j12 <- j1:j2
+ sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*psi.sig.ML.mod(j=j12,mui=mui,invsig=invsig)*dnbinom(x=j12,mu=mui,size=invsig))
+ }
+ }
+ }
+ sig.rob.tukey <- function(sig,y,mu,c.tukey){
+ r <- (y-mu)/sqrt(varfunc(mu,sig))
+ wi <- tukeypsi(r=r,c.tukey=c.tukey)/r
+ sum(wi*psi.sig.ML(r=r,mu=mu,sig=sig)-sapply(X=mu,FUN=ai.sig.tukey,sig=sig,c.tukey=c.tukey))
+ }
+ sig0 <- sig+1+tol
+ beta11 <- log(inislope)
+ #beta11 <- log(median(y[which(y<=1.5*quantile(y,0.75))]/x[which(y<=1.5*quantile(y,0.75))])) #slope estimated with median
+ beta00 <- beta11+tol+1
+ it <- 0
+ while((abs(sig-sig0)>tol | abs(max(beta11-beta00))>tol) & itmaxsig){sig <- maxsig}
+ if (sigtol & it.mumaxmu)] <- maxmu
+ mu[which(mu==0)] <- minmu
+ eta <- log(mu)
+ it.mu <- it.mu+1
+ }
+ beta11 <- beta1
+ it <- it+1
+ }
+ #build the result
+ x <- xx
+ y <- yy
+ if(extra_rob_out) {
+ coverage = ma_count = rep(0,length(y)+length(pos_not_in_reg))
+ ma_count[setdiff(1:(length(y)+length(pos_not_in_reg)),pos_not_in_reg)]=y
+ coverage[setdiff(1:(length(x)+length(pos_not_in_reg)),pos_not_in_reg)]=x
+ ma_count[pos_not_in_reg]=y_not_in_reg
+ coverage[pos_not_in_reg]=x_not_in_reg
+ y=ma_count; x=coverage
+ }
+ res$coverage <- x
+ res$ma_count <- y
+ res$coef <- c(sigma=sig,slope=exp(beta1[[1]]))
+ if(res$coef[[2]] > 1) res$coef[[2]] = 1
+ res$pvalues <- dnbinom(y,size=1/res$coef[[1]],mu=res$coef[[2]]*x) + pnbinom(y,size=1/res$coef[[1]],mu=res$coef[[2]]*x,lower.tail = F)
+ res$qvalues=p.adjust(res$pvalues,method="BH")
+ res$GQ=-log10(res$qvalues)*10
+ res$GQ[res$GQ>1000]=1000 #here also manage qvalues=Inf
+ res$extra_rob=extra_rob_out #did we exclude samples for fitting?
+ } else {stop('Available bounding.func is "T/T"')}
+ return(res)
+}
diff --git a/bin/needlestack.r b/bin/needlestack.r
index 7cd2c31..8e60d6b 100755
--- a/bin/needlestack.r
+++ b/bin/needlestack.r
@@ -1,375 +1,21 @@
#! /usr/bin/env Rscript
# needlestack: a multi-sample somatic variant caller
-# Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme
+# Copyright (C) 2017 Matthieu Foll, Tiffany Delhomme and Nicolas Alcala
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
-#
+#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
-#
+#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see .
-glmrob.nb <- function(y,x,bounding.func='T/T',c.tukey.beta=5,c.tukey.sig=3,c.by.beta=4,weights.on.x='none',
- minsig=1e-3,maxsig=10,minmu=1e-10,maxmu=1e5,maxit=30,maxit.sig=50,sig.prec=1e-8,tol=1e-6,
- n_ai.sig.tukey=100,n_xout=10^4,min_coverage=1,min_reads=1,size_min=10,...){
-
- if (max(x)0)]
- x <- x[which(x>0)]
- if(length(x)<10) return(res=list("coverage"=x, "ma_count"=y, "coef"=c(sigma=NA,slope=NA), "pvalues"=rep(1,l=length(y)), "qvalues"=rep(1,l=length(y)),"GQ"=rep(0,l=length(y))))
- n <- length(y)
- X <- matrix(log(x))
- if (dim(X)[1]!=n){stop('length(y) does not match dim(X)[1].')}
- onevec <- rep(1,n)
- if (identical(X[,1],onevec)){X <- X[,-1]}
- res <- list()
- #-------------------------------------------------------------------
- # initial estimates: MLEs for beta and sigma
- #-------------------------------------------------------------------
- invlink <- function(eta){exp(eta)}
- derivlink <- function(mu){1/mu}
- varfunc <- function(mu,sig){mu+sig*mu^2}
- # initial mu computed through median outliers method
- af=y/exp(X)
- af[which(y==0)] <- 0
- inislope <- mean(af[which(af<=quantile(af,0.75)+1.5*diff(quantile(af,c(0.25,0.75))))])
- if(inislope<=1/(10*mean(exp(X)))){inislope<-1/(10*mean(exp(X)))}
- mu <- inislope*exp(X)
- mu[which(mu>maxmu)] <- maxmu
- mu[which(mumaxsig){sig <- maxsig}
- if (sigc.tukey,0,((r/c.tukey)^2-1)^2*r)
- }
- E.tukeypsi.1 <- function(mui,sig,c.tukey){
- sqrtVmui <- sqrt(varfunc(mui,sig))
- j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
- j2 <- floor(mui+c.tukey*sqrtVmui)
- if (j1>j2){0}
- else {
- if (j2-j1+1 > 2*n_ai.sig.tukey) {
- j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
- yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)*dnbinom(j12,mu=mui,size=1/sig)
- integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]]/sqrtVmui
- } else {
- j12 <- j1:j2
- sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)*dnbinom(j12,mu=mui,size=1/sig))/sqrtVmui
- }
- }
- }
- E.tukeypsi.2 <- function(mui,sig,c.tukey){
- sqrtVmui <- sqrt(varfunc(mui,sig))
- j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
- j2 <- floor(mui+c.tukey*sqrtVmui)
- if (j1>j2){0}
- else {
- if (j2-j1+1 > 2*n_ai.sig.tukey) {
- j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
- yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)^2*dnbinom(j12,mu=mui,size=1/sig)
- integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]]/sqrtVmui
- } else {
- j12 <- j1:j2
- sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*(j12-mui)^2*dnbinom(j12,mu=mui,size=1/sig))/sqrtVmui
- }
- }
- }
- ai.sig.tukey <- function(mui,sig,c.tukey){
- psi.sig.ML.mod <- function(j,mui,invsig){
- digamma(j+invsig)-digamma(invsig)-log(mui/invsig+1)-(j-mui)/(mui+invsig)
- }
- sqrtVmui <- sqrt(mui*(sig*mui+1))
- invsig <- 1/sig
- j1 <- max(c(ceiling(mui-c.tukey*sqrtVmui),0))
- j2 <- floor(mui+c.tukey*sqrtVmui)
- if (j1>j2){0}
- else {
- if (j2-j1+1 > 2*n_ai.sig.tukey) {
- j12 <- round(seq(j1,j2,l=n_ai.sig.tukey))
- yj12 <- (((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*psi.sig.ML.mod(j=j12,mui=mui,invsig=invsig)*dnbinom(x=j12,mu=mui,size=invsig)
- integrate(splinefun(x = j12, y=yj12,method = "natural"), lower = min(j12), upper = max(j12)+1)[[1]] #xout=j1:j2
- } else {
- j12 <- j1:j2
- sum((((j12-mui)/(c.tukey*sqrtVmui))^2-1)^2*psi.sig.ML.mod(j=j12,mui=mui,invsig=invsig)*dnbinom(x=j12,mu=mui,size=invsig))
- }
- }
- }
- sig.rob.tukey <- function(sig,y,mu,c.tukey){
- r <- (y-mu)/sqrt(varfunc(mu,sig))
- wi <- tukeypsi(r=r,c.tukey=c.tukey)/r
- sum(wi*psi.sig.ML(r=r,mu=mu,sig=sig)-sapply(X=mu,FUN=ai.sig.tukey,sig=sig,c.tukey=c.tukey))
- }
- sig0 <- sig+1+tol
- beta11 <- log(inislope)
- #beta11 <- log(median(y[which(y<=1.5*quantile(y,0.75))]/x[which(y<=1.5*quantile(y,0.75))])) #slope estimated with median
- beta00 <- beta11+tol+1
- it <- 0
- while((abs(sig-sig0)>tol | abs(max(beta11-beta00))>tol) & itmaxsig){sig <- maxsig}
- if (sigtol & it.mumaxmu)] <- maxmu
- mu[which(mu==0)] <- minmu
- eta <- log(mu)
- it.mu <- it.mu+1
- }
- beta11 <- beta1
- it <- it+1
- }
- #build the result
- x <- xx
- y <- yy
- res$coverage <- x
- res$ma_count <- y
- res$coef <- c(sigma=sig,slope=exp(beta1[[1]]))
- res$pvalues <- dnbinom(y,size=1/res$coef[[1]],mu=res$coef[[2]]*x) + pnbinom(y,size=1/res$coef[[1]],mu=res$coef[[2]]*x,lower.tail = F)
- res$qvalues=p.adjust(res$pvalues,method="BH")
- res$GQ=-log10(res$qvalues)*10
- } else {stop('Available bounding.func is "T/T"')}
- return(res)
-}
-
-plot_rob_nb <- function(rob_nb_res,qthreshold=0.01,plot_title=NULL,sbs,SB_threshold=Inf,names=NULL,plot_labels=FALSE,add_contours=FALSE){
- n=sum(rob_nb_res$qvalue>qthreshold)
- m=sum(rob_nb_res$qvalue<=qthreshold)
-
- cut_max_qvals=100
- cols=rep("black",length(rob_nb_res$coverage))
- palette=rev(rainbow(cut_max_qvals+1,start=0, end=4/6))
- col_indices = round(-10*log10(rob_nb_res$qvalues))+1
- col_indices[which(col_indices>cut_max_qvals)]=cut_max_qvals+1
- cols=palette[col_indices]
- outliers_color=cols
- outliers_color[which(sbs>SB_threshold)]="black"
-
- temp_title = bquote(e==.(format(rob_nb_res$coef[[2]],digits = 2)) ~ "," ~ sigma==.(format(rob_nb_res$coef[[1]],digits = 2)))
- plot(rob_nb_res$coverage, rob_nb_res$ma_count,
- pch=21,bg=cols,col=outliers_color,xlab="Coverage (DP)",ylab="Number of ALT reads (AO)",
- main=plot_title, xlim=c(0,max(rob_nb_res$coverage)),ylim=c(0,max(rob_nb_res$ma_count)))
- mtext(temp_title)
- #### labeling outliers
- if(!is.null(names) & plot_labels & length(names[which(rob_nb_res$qvalues<=qthreshold)]) > 0) {
- text(rob_nb_res$coverage[which(rob_nb_res$qvalues<=qthreshold)], rob_nb_res$ma_count[which(rob_nb_res$qvalues<=qthreshold)],
- labels=names[which(rob_nb_res$qvalues<=qthreshold)], cex= 0.6, pos=1)
- }
-
- #### plot the color palette
- plot_palette <- function(topright=FALSE) {
- xmin <- par("usr")[1]
- xmax <- par("usr")[2]
- ymin <- par("usr")[3]
- ymax <- par("usr")[4]
- xright=xmin+(xmax-xmin)*(1-0.9)
- xleft=xmin+(xmax-xmin)*(1-0.94)
- if(topright){xright=xmin+(xmax-xmin)*0.9;xleft=xmin+(xmax-xmin)*0.94}
- ybottom=ymin+(ymax-ymin)*0.72
- ytop=ymin+(ymax-ymin)*0.94
-
- rasterImage(as.raster(matrix(rev(palette), ncol=1)),xright ,ybottom ,xleft,ytop )
- rect(xright ,ybottom ,xleft,ytop )
- text(x=(xright+xleft)/2, y = ytop+(ytop-ybottom)*0.1, labels = "QVAL", cex=0.8)
- keep_labels=seq(0,cut_max_qvals,by=20)
- keep_labels_pos=seq(ybottom,ytop,l=length(keep_labels))
- tick_width=-(xleft-xright)/5
- for (i in 1:length(keep_labels)) {
- if (topright) {
- lines(c(xright,xright+tick_width),c(keep_labels_pos[i],keep_labels_pos[i]))
- } else {
- lines(c(xleft,xleft-tick_width),c(keep_labels_pos[i],keep_labels_pos[i]))
- }
- }
- if(topright) {
- text(x=xright+1.5*tick_width, y = keep_labels_pos, labels = keep_labels,adj = c(1,0.5), cex = 0.8)
- } else {
- text(x=(xright-(xright-xleft))*0.3, y = keep_labels_pos, labels = keep_labels,adj = c(1,0.5), cex = 0.8)
- }
- }
- plot_palette()
- ####
-
- if (!is.na(rob_nb_res$coef["slope"])) {
- ################### ADD CONTOURS ##################
- max_nb_grid_pts = 2500
- max_qvalue=100
- qlevels = c(10,30,50,70,100)
- # following function returns a qvalue for a new point by adding it in the set of observations, recomputing all qvalues and taking its corresponding value.
- toQvalue <- function(x,y){
- unlist(-10*log10(p.adjust((dnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x)) +
- pnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x),lower.tail = F)))
- [length(rob_nb_res$coverage)+1]))
- }
- #here we compute the dimension of the qvalue grid (ylength*xlength), with min(ylength)=5 (this avoids a too "flat" grid)
- #if needed to sampling (too large grid if dimension=max(AO)*max(DP)), we verify two equations: equality of ratios ylength/xlength before and after sampling and ylength*xlength=max_nb_grid_pts
- #### compute zoom y limit
- if(add_contours){
- nb_pts_zoom_computation=1000
- if(max(rob_nb_res$coverage) > nb_pts_zoom_computation){
- maxDP_AO = unique(sort(c(round(max(rob_nb_res$coverage)*rbeta(nb_pts_zoom_computation,1,100)),runif(100,1,max(rob_nb_res$coverage)))))
- } else {
- maxDP_AO = seq(1,max(rob_nb_res$coverage),by=1)
- }
- maxDP_qvals = unlist(lapply(maxDP_AO,function(AO){ toQvalue(x=max(rob_nb_res$coverage),y=AO) }))
- af_min_lim = log10(maxDP_AO[which(maxDP_qvals>=qlevels[1])[1]]/max(rob_nb_res$coverage))
- ylim_zoom = maxDP_AO[which(maxDP_qvals>=max_qvalue)[1]]
- ylim_zoom_cor=ifelse(is.na(ylim_zoom),max(rob_nb_res$ma_count),ylim_zoom)
- if(!is.na(ylim_zoom)){ #ylim_zoom is na iff we found at least one qvalue >= max_qvalue (if error rate closed to 1, only qvalues closed to 0)
- #### compute dim of the qvalue grid
- if(ylim_zoom*max(rob_nb_res$coverage) <= max_nb_grid_pts){
- xgrid = seq(0,max(rob_nb_res$coverage), by=1)
- ygrid = seq(0,ylim_zoom,by=1) #use by=1 to have integer, if not dnbinom not happy
- } else {
- if(ylim_zoom<=50){
- ygrid=seq(0,ylim_zoom,by=1)
- xlength = round(max_nb_grid_pts/length(ygrid))
- xgrid = round(seq(0,max(rob_nb_res$coverage),length=xlength))
- } else {
- ylength = 50
- xlength = round(max_nb_grid_pts/ylength)
- xgrid = round(seq(0,max(rob_nb_res$coverage),length=xlength))
- ygrid = round(seq(0,ylim_zoom,length=ylength))
- }
- }
- #here we initiate the grid with each case containing list=(DP,AO) from xgrid, ygrid
- matgrid = array(as.list(as.data.frame(t(expand.grid(xgrid,ygrid)))),dim=c(length(xgrid),length(ygrid)))
- #then we fill in the grid with qvalues for each pair of AO,DP taken from ygrid,xgrid vectors (we use initiated values to identify corresponding AO,DP). Finally we plot the contours.
- matgrid=matrix(sapply(matgrid,function(case) toQvalue(unlist(case)[1],unlist(case)[2])), length(xgrid),length(ygrid))
- #### plot the contour "by hands"
- for(qvalue in qlevels) {
- lines(xgrid, unlist(lapply(xgrid,function(DP,ygrid,xgrid){
- qval=min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
- AO=min(ygrid[which(matgrid[match(DP,xgrid),]==qval)])
- AO },ygrid,xgrid)),col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)],lwd=1.3,lty=3)
- }
- }
- }
- #### plot confidence interval + error rate
- xi=max(rob_nb_res$coverage)
- yi1=qnbinom(p=0.99, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*xi)
- yi2=qnbinom(p=0.01, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*xi)
- abline(a=0, b=yi1/xi, lwd=2, lty=3, col="blue")
- abline(a=0, b=yi2/xi, lwd=2, lty=3, col="blue")
- abline(a=0, b=rob_nb_res$coef[[2]], col="blue")
- #### plot zoom on max qvalue if add_contours, otherwise on 2*IC
- if(!add_contours) ylim_zoom_cor = 2*yi1
- plot(rob_nb_res$coverage, rob_nb_res$ma_count,
- pch=21,bg=cols,col=outliers_color,xlab="Coverage (DP)",ylab="Number of ALT reads (AO)",
- main=plot_title, ylim=c(0,ylim_zoom_cor), xlim=c(0,max(rob_nb_res$coverage)))
- #### plot the contour "by hands"
- if(add_contours){
- if(!is.na(ylim_zoom)){
- for(qvalue in qlevels) {
- lines(xgrid, unlist(lapply(xgrid,function(DP,ygrid,xgrid){
- qval=min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
- AO=min(ygrid[which(matgrid[match(DP,xgrid),]==qval)])
- AO },ygrid,xgrid)),col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)],lwd=1.3,lty=3)
- }
- }
- }
- #contour(xgrid, ygrid, matgrid, levels=qlevels , col = rev(rainbow(length(qlevels),start=0, end=4/6)), add=T, lwd = 1.3, labcex = 0.8, lty=3)
- mtext(paste("zoom on maximum q-value =",max_qvalue))
- #### labeling outliers and plot confidence interval + error rate
- if(!is.null(names) & plot_labels & length(names[which(rob_nb_res$qvalues<=qthreshold)]) > 0) {
- text(rob_nb_res$coverage[which(rob_nb_res$qvalues<=qthreshold)], rob_nb_res$ma_count[which(rob_nb_res$qvalues<=qthreshold)],
- labels=names[which(rob_nb_res$qvalues<=qthreshold)], cex= 0.6, pos=1)
- }
- abline(a=0, b=yi1/xi, lwd=2, lty=3, col="blue")
- abline(a=0, b=yi2/xi, lwd=2, lty=3, col="blue")
- abline(a=0, b=rob_nb_res$coef[[2]], col="blue")
- plot_palette()
-
- logqvals=log10(rob_nb_res$qvalues)
- logqvals[logqvals<=-9]=-9
- plot(logqvals,rob_nb_res$ma_count/rob_nb_res$coverage,pch=21,bg=cols,col=outliers_color,ylab="Allelic fraction (AF)",xlab=bquote("log"[10] ~ "(q-value)"),main="Allelic fraction effect")
- abline(v=log10(qthreshold),col="red",lwd=2)
- plot_palette(topright = TRUE)
- ylim_zoom_af = ifelse(add_contours, ylim_zoom_cor/max(rob_nb_res$coverage), (2*yi1)/xi)
- plot(logqvals,rob_nb_res$ma_count/rob_nb_res$coverage,pch=21,bg=cols,col=outliers_color,ylab="Allelic fraction (AF)",xlab=bquote("log"[10] ~ "(q-value)"),main="Allelic fraction effect",
- ylim=c(0,ylim_zoom_af))
- if(add_contours) { mtext(paste("zoom on maximum q-value =",max_qvalue)) } else { mtext("zoom on 99% confidence interval") }
- abline(v=log10(qthreshold),col="red",lwd=2)
- plot_palette(topright = TRUE)
- if(add_contours){
- if(!is.na(ylim_zoom)){
- #### plot min(af) ~ DP
- plot(1,type='n', ylim=c(1.1*af_min_lim,0), xlim=range(xgrid), xlab="DP", ylab=bquote("log"[10] ~ "[min(AF)]"))
- for(qvalue in qlevels) {
- lines(xgrid, unlist(lapply(xgrid,function(DP,ygrid,xgrid){
- qval=min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
- af=min(ygrid[which(matgrid[match(DP,xgrid),]==qval)]) / DP
- if(DP==0 || af>1) { af=1 } #af>1 if min(...)>DP
- log10(af) },ygrid,xgrid)),col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)])
- }
- plot_palette(topright = TRUE)
- #hist(rob_nb_res$pvalues,main="p-values distribution",ylab="Density",xlab="p-value",col="grey",freq=T,br=20,xlim=c(0,1))
- #hist(rob_nb_res$qvalues,main="q-values distribution",breaks=20,xlab="q-value",col="grey",freq=T,xlim=c(0,1))
- }
- }
- }
-}
-
############################## ARGUMENTS SECTION ##############################
args <- commandArgs(TRUE)
parseArgs <- function(x) strsplit(sub("^--", "", x), "=")
@@ -377,36 +23,41 @@ argsL <- as.list(as.character(as.data.frame(do.call("rbind", parseArgs(args)))$V
names(argsL) <- as.data.frame(do.call("rbind", parseArgs(args)))$V1
args <- argsL;rm(argsL)
-if("--help" %in% args | is.null(args$out_file) | is.null(args$fasta_ref)) {
+if("--help" %in% args | is.null(args$out_file) | is.null(args$fasta_ref) | is.null(args$source_path) ) {
cat("
The R Script arguments_section.R
-
+
Mandatory arguments:
--out_file=file_name - name of output vcf
+ --source_path=path - path to source files (glm_rob_nb.r, plot_rob_nb.r)
--fasta_ref=path - path of fasta ref
--help - print this text
-
+
Optionnal arguments:
--samtools=path - path of samtools, default=samtools
- --SB_type=SOR or RVSB - strand bias measure, default=SOR
+ --SB_type=SOR, RVSB or FS - strand bias measure, default=SOR
--SB_threshold_SNV=value - strand bias threshold for SNV, default=100
--SB_threshold_indel=value - strand bias threshold for indel, default=100
--min_coverage=value - minimum coverage in at least one sample to consider a site, default=50
--min_reads=value - minimum number of non-ref reads in at least one sample to consider a site, default=5
--GQ_threshold=value - phred scale qvalue threshold for variants, default=50
- --output_all_SNVs=boolean - output all SNVs, even when no variant is detected, default=FALSE
- --do_plots=boolean - output regression plots, default=TRUE
-
+ --output_all_SNVs=boolean - output all SNVs, even when no variant is detected, default=FALSE
+ --do_plots=boolean - output regression plots, default=TRUE
+ --extra_rob=boolean - perform an extra-robust regression, default=FALSE
+ --pairs_file=file_name - name of file containing the list of matched Tumor/Normal pairs for somatic variant calling
+ --afmin_power=value - minimum allelic fraction in mean for somatic mutations, default=0.01
+ --sigma=value - sigma parameter for negative binomial modeling germline mutations, default=0.1
+
WARNING : by default samtools has to be in your path
-
+
Example:
- pileup_nbrr_caller_vcf.r --out_file=test.vcf --fasta_ref=~/Documents/References/ \n\n")
-
+ needlestack.r --out_file=test.vcf --fasta_ref=~/Documents/References/ \n\n")
+
q(save="no")
}
if(is.null(args$samtools)) {args$samtools="samtools"}
-if(is.null(args$SB_type)) {args$SB_type="SOR"}
+if(is.null(args$SB_type)) {args$SB_type="SOR"}
if(is.null(args$SB_threshold_SNV)) {args$SB_threshold_SNV=100} else {args$SB_threshold_SNV=as.numeric(args$SB_threshold_SNV)}
if(is.null(args$SB_threshold_indel)) {args$SB_threshold_indel=100} else {args$SB_threshold_indel=as.numeric(args$SB_threshold_indel)}
if(is.null(args$min_coverage)) {args$min_coverage=50} else {args$min_coverage=as.numeric(args$min_coverage)}
@@ -416,6 +67,10 @@ if(is.null(args$output_all_SNVs)) {args$output_all_SNVs=FALSE} else {args$output
if(is.null(args$do_plots)) {args$do_plots=TRUE} else {args$do_plots=as.logical(args$do_plots)}
if(is.null(args$plot_labels)) {args$plot_labels=FALSE} else {args$plot_labels=as.logical(args$plot_labels)}
if(is.null(args$add_contours)) {args$add_contours=FALSE} else {args$add_contours=as.logical(args$add_contours)}
+if(is.null(args$extra_rob)) {args$extra_rob=FALSE} else {args$extra_rob=as.logical(args$extra_rob)}
+if(is.null(args$pairs_file)) {args$pairs_file=FALSE}
+if(is.null(args$afmin_power)) {args$afmin_power=-1} else {args$afmin_power=as.numeric(args$afmin_power)}
+if(is.null(args$sigma)) {args$sigma=0.1} else {args$sigma=as.numeric(args$sigma)}
samtools=args$samtools
out_file=args$out_file
@@ -432,6 +87,14 @@ output_all_SNVs=args$output_all_SNVs
do_plots=args$do_plots
plot_labels=args$plot_labels
add_contours=args$add_contours
+extra_rob=args$extra_rob
+pairs_file=args$pairs_file
+sigma=args$sigma
+afmin_power=args$afmin_power
+# in case the argument is not given
+
+source(paste(args$source_path,"glm_rob_nb.r",sep=""))
+source(paste(args$source_path,"plot_rob_nb.r",sep=""))
############################################################
@@ -440,7 +103,24 @@ options("scipen"=100)
indiv_run=read.table("names.txt",stringsAsFactors=F,colClasses = "character")
indiv_run[,2]=make.unique(indiv_run[,2],sep="_")
-pileups_files=paste(indiv_run[,1],".txt",sep="")
+isTNpairs = FALSE #activates Tumor-Normal pairs mode
+if(pairs_file != FALSE) { #if user gives a pairs_file to needlestack
+ isTNpairs = file.exists(pairs_file) #checks existence of tumour-normal pairs file => will be redundant once this is checked in the workflow
+ if( isTNpairs ){
+ if(afmin_power==-1) afmin_power = 0.1 #put default value
+ pairsname = scan(pairs_file,nmax = 2,what = "character")
+ TNpairs=read.table(pairs_file,h=T)
+ names(TNpairs)[grep("TU",pairsname,ignore.case =T)] = "TUMOR" #set columns names (to avoid problems due to spelling variations or typos)
+ names(TNpairs)[grep("NO",pairsname,ignore.case =T)] = "NORMAL"
+ onlyNindex = which( sapply(indiv_run[,2], function(x) x%in%TNpairs$NORMAL[is.na(TNpairs$TUMOR)] ) )
+ onlyTindex = which( sapply(indiv_run[,2], function(x) x%in%TNpairs$TUMOR[is.na(TNpairs$NORMAL)] ) )
+ TNpairs.complete = TNpairs[!(is.na(TNpairs$TUMOR)|is.na(TNpairs$NORMAL) ),] # all complete T-N pairs
+ Tindex = sapply( 1:nrow(TNpairs.complete) , function(k) return(which( indiv_run[,2]==TNpairs.complete$TUMOR[k])) )
+ Nindex = sapply( 1:nrow(TNpairs.complete) , function(k) return(which( indiv_run[,2]==TNpairs.complete$NORMAL[k])) )
+ }
+}
+
+pileups_files=paste("TABLE/",indiv_run[,1],".txt",sep="")
nindiv=nrow(indiv_run)
npos=length(readLines(pileups_files[1]))-1
@@ -502,17 +182,33 @@ common_annot=function() {
all_sor<<-SOR(sum(Rp),sum(Vp),sum(Rm),sum(Vm))
if (is.na(all_sor)) all_sor<<- (-1)
if (is.infinite(all_sor)) all_sor<<- 99
- #FisherStrand<<-fisher.test(matrix(c(Vp,Vm,Rp,Rm),nrow=2))$p.value
+ FisherStrand<<- -10*log10(unlist(lapply(1:nindiv, function(indiv) fisher.test(matrix(c(Vp[indiv],Vm[indiv],Rp[indiv],Rm[indiv]), nrow=2))$p.value ))) #col1=alt(V), col2=ref(R)
+ FisherStrand[which(FisherStrand>1000)] = 1000
+ FisherStrand_all<<--10*log10(fisher.test(matrix(c(sum(Vp),sum(Vm),sum(Rp),sum(Rm)), nrow=2))$p.value)
+ if(FisherStrand_all>1000) FisherStrand_all=1000
all_RO<<-sum(Rp+Rm)
}
-if(file.exists(out_file)) file.remove(out_file)
+## functions to compute the Qvalue of a function with a given AF
+toQvalueN <- function(x,rob_nb_res,sigma){ #change sigma value to change the departure from binomial distribution with parameter 0.5
+ y = qnbinom(0.01,size = 1/sigma,mu = 0.5*x)
+ unlist(-10*log10(p.adjust((dnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x)) +
+ pnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x),lower.tail = F)))[length(rob_nb_res$coverage)+1]))
+}
+
+toQvalueT <- function(x,rob_nb_res,afmin_power){ #change afmin_power to
+ y = qbinom(0.01,x,afmin_power,lower.tail = T)
+ unlist(-10*log10(p.adjust((dnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x)) +
+ pnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x),lower.tail = F)))[length(rob_nb_res$coverage)+1]))
+}
+
+if(file.exists(out_file)) file.remove(out_file)
write_out=function(...) {
cat(paste(...,sep=""),"\n",file=out_file,sep="",append=T)
}
write_out("##fileformat=VCFv4.1")
write_out("##fileDate=",format(Sys.Date(), "%Y%m%d"))
-write_out("##source=needlestack v0.3")
+write_out("##source=needlestack v1.0")
write_out("##reference=",fasta_ref)
write_out("##phasing=none")
write_out("##filter=\"QVAL > ",GQ_threshold," & ",SB_type,"_SNV < ",SB_threshold_SNV," & ",SB_type,"_INDEL < ",SB_threshold_indel," & min(AO) >= ",min_reads," & min(DP) >= ",min_coverage,"\"")
@@ -529,37 +225,96 @@ write_out("##INFO=")
write_out("##INFO=")
write_out("##INFO=")
+write_out("##INFO=")
write_out("##INFO=")
write_out("##INFO=")
write_out("##INFO=")
+write_out("##INFO=")
write_out("##FORMAT=")
-write_out("##FORMAT=")
-write_out("##FORMAT=")
+write_out("##FORMAT=")
+write_out("##FORMAT=")
+write_out("##FORMAT=")
write_out("##FORMAT=")
write_out("##FORMAT=")
write_out("##FORMAT=")
write_out("##FORMAT=")
write_out("##FORMAT=")
write_out("##FORMAT=")
+write_out("##FORMAT=")
+write_out("##FORMAT=")
+write_out("##FORMAT=")
+write_out("##FORMAT=")
write_out("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t",paste(indiv_run[,2],collapse = "\t"))
for (i in 1:npos) {
- if (is.element(pos_ref[i,"ref"],c("A","T","C","G"))) {
+ if (is.element(pos_ref[i,"ref"],c("A","T","C","G"))) {
# SNV
for (alt in non_ref_bases(pos_ref[i,"ref"])) {
+ ref=pos_ref[i,"ref"]
Vp=atcg_matrix[i,eval(as.name(paste(alt,"_cols",sep="")))]
Vm=atcg_matrix[i,eval(as.name(paste(tolower(alt),"_cols",sep="")))]
ma_count=Vp+Vm
DP=coverage_matrix[i,]
- reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads)
+ if( sum( (ma_count/DP) > 0.8 , na.rm = T) > 0.5*length(ma_count) ){ #here we need to reverse alt and ref for the regression (use ma_count of the ref)
+ alt_inv=pos_ref[i,"ref"]
+ Vp_inv=atcg_matrix[i,eval(as.name(paste(alt_inv,"_cols",sep="")))]
+ Vm_inv=atcg_matrix[i,eval(as.name(paste(tolower(alt_inv),"_cols",sep="")))]
+ ma_count_inv=Vp_inv+Vm_inv
+ ref_inv=TRUE
+ reg_res=glmrob.nb(x=DP,y=ma_count_inv,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob)
+ } else { ref_inv=FALSE; reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob) }
+
+ # compute Qval for minAF
+ qval_minAF= rep(0,nindiv)
+ somatic_status = rep(".",nindiv)
+ if(isTNpairs){#pairs file supplied
+ qval_minAF[Nindex] = sapply(1:length(Nindex),function(ii) toQvalueN(DP[Nindex][ii],reg_res,sigma) )
+ if(sum(reg_res$GQ[Nindex]>GQ_threshold)>0) qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueN(DP[Tindex][ii],reg_res,sigma) ) #GERMLINE VARIANT, check if power to CONFIRM in tumors
+ else qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueT(DP[Tindex][ii],reg_res,afmin_power) ) #no germline variant, check if power to call SOMATIC in tumors
+ if( length(onlyNindex)>0 ) qval_minAF[onlyNindex] = sapply(1:length(onlyNindex),function(ii) toQvalueN(DP[onlyNindex][ii],reg_res,sigma) )
+ if( length(onlyTindex)>0 ) qval_minAF[onlyTindex] = sapply(1:length(onlyTindex),function(ii) toQvalueT(DP[onlyTindex][ii],reg_res,afmin_power) )
+ #no matching normal -> UNKNOWN (impossible to call somatic status)
+ somatic_status[onlyTindex][(reg_res$GQ[onlyTindex] > GQ_threshold) ] = "UNKNOWN"
+ #no matching tumor -> GERMLINE_UNCONFIRMABLE (impossible to call somatic status)
+ somatic_status[onlyNindex][(reg_res$GQ[onlyNindex] > GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ #no tumor variant, no normal variant despite good power in both -> ".", confirmed; already set by default
+ #no tumor variant, no normal variant but low power in N and T-> ".", unknown; already set by default
+ #no tumor variant, no normal variant but low power in N or T-> ".", unconfirmed ; already set by default
+ #no tumor variant, normal variant with good power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMED"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ #no tumor variant, normal variant with low power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+
+ #tumor variant, no normal variant but without enough power -> UNKNOWN for both Tumor and Normal
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex] GQ_threshold)&(qval_minAF[Nindex] GERMLINE_CONFIRMED
+ somatic_status[Tindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ somatic_status[Nindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ #tumor variant, no normal variant despite good power -> SOMATIC
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex]>GQ_threshold)&(reg_res$GQ[Nindex]0 ) somatic_status[Tindex][somatic_status[Tindex] == "SOMATIC"] = paste("POSSIBLE_CONTAMINATION_FROM", paste(indiv_run[wh.germ,2],sep="_",collapse="_"),sep="_")
+ }else{# no pairs file supplied
+ if(afmin_power==-1 ){#no minimum frequency supplied -> use a negative binomial distribution to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueN(DP[ii],reg_res,sigma) )
+ }else{#minimum frequency supplied -> use a binomial distribution centered around afmin_power to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueT(DP[ii],reg_res,afmin_power) )
+ }
+ }
+
if (output_all_SNVs | (!is.na(reg_res$coef["slope"]) & sum(reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0)) {
all_AO=sum(ma_count)
all_DP=sum(coverage_matrix[i,])
common_annot()
all_RO=sum(Rp+Rm)
- if (SB_type=="SOR") sbs=sors else sbs=rvsbs
+ if (SB_type=="SOR") sbs=sors else { if (SB_type=="RVSB") sbs=rvsbs else {sbs=FisherStrand} }
if (output_all_SNVs | (sum(sbs<=SB_threshold_SNV & reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0)) {
con=pipe(paste(samtools," faidx ",fasta_ref," ",pos_ref[i,"chr"],":",pos_ref[i,"loc"]-3,"-",pos_ref[i,"loc"]-1," | tail -n1",sep=""))
before=readLines(con)
@@ -569,20 +324,26 @@ for (i in 1:npos) {
close(con)
cat(pos_ref[i,"chr"],"\t",pos_ref[i,"loc"],"\t",".","\t",pos_ref[i,"ref"],"\t",alt,"\t",max(reg_res$GQ),"\t",".",sep = "",file=out_file,append=T)
# INFO field
- cat("\t","TYPE=snv;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),sep="",file=out_file,append=T)
+ cat("\t","TYPE=snv;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";FS=",FisherStrand_all,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),ifelse(reg_res$extra_rob & !(ref_inv),";WARN=EXTRA_ROBUST_GL",""),ifelse(reg_res$extra_rob & ref_inv,";WARN=EXTRA_ROBUST_GL/INV_REF",""),ifelse(ref_inv & !(reg_res$extra_rob),";WARN=INV_REF",""),sep="",file=out_file,append=T)
# FORMAT field
- cat("\t","GT:QVAL:DP:RO:AO:AF:SB:SOR:RVSB",sep = "",file=out_file,append=T)
+ cat("\t",paste("GT:",ifelse(ref_inv,"QVAL_INV","QVAL"),":DP:RO:AO:AF:SB:SOR:RVSB:FS:QVAL_minAF:STATUS",sep=""),sep = "",file=out_file,append=T)
+
# all samples
- genotype=rep("0/0",l=nindiv)
- variants=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV)
- genotype[variants]="0/1"
+ if(ref_inv) { genotype=rep("1/1",l=nindiv) } else { genotype=rep("0/0",l=nindiv) }
+ heterozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage < 0.75)
+ genotype[heterozygotes]="0/1"
+ homozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage >= 0.75)
+ if(ref_inv) { genotype[homozygotes]="0/0" } else { genotype[homozygotes]="1/1" }
+ #no tumor variant but low power -> "./."
+ genotype[(reg_res$GQ < GQ_threshold)&(qval_minAF% .(alt),")",sep="")), sbs=sbs, SB_threshold=SB_threshold_SNV,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
+ pdf(paste(pos_ref[i,"chr"],"_",pos_ref[i,"loc"],"_",pos_ref[i,"loc"],"_",ref,"_",alt,ifelse(ref_inv,"_inv_ref",""),ifelse(reg_res$extra_rob,"_extra_robust",""),".pdf",sep=""),7,6)
+ plot_rob_nb(reg_res, 10^-(GQ_threshold/10), plot_title=bquote(paste(.(pos_ref[i,"chr"]),":",.(pos_ref[i,"loc"])," (",.(ref) %->% .(alt),")",.(ifelse(ref_inv," INV REF","")),.(ifelse(reg_res$extra_rob," EXTRA ROBUST","")),sep="")), sbs=sbs, SB_threshold=SB_threshold_SNV,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
dev.off()
}
}
@@ -605,55 +366,115 @@ for (i in 1:npos) {
ma_p_cur_del=unlist(lapply(all_cur_del_p,function(x) {as.numeric(gsub(paste("([0-9]+):",cur_del,"$",sep=""),"\\1",x,ignore.case = T))}))
ma_m_cur_del=unlist(lapply(all_cur_del_m,function(x) {as.numeric(gsub(paste("([0-9]+):",cur_del,"$",sep=""),"\\1",x,ignore.case = T))}))
Vp[names(ma_p_cur_del)]=ma_p_cur_del
- Vm[names(ma_m_cur_del)]=ma_m_cur_del
+ Vm[names(ma_m_cur_del)]=ma_m_cur_del
ma_count=Vp+Vm
- DP=coverage_matrix[i,]+ma_count
- reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads)
+ DP=coverage_matrix[i,]
+ if( sum( (ma_count/DP) > 0.8 , na.rm = T) > 0.5*length(ma_count) ){ #here we need to reverse alt and ref for the regression (use ma_count of the ref)
+ ref=pos_ref[i,"ref"]
+ cur_del_inv=pos_ref[i,"ref"]
+ Vp_inv=atcg_matrix[i,eval(as.name(paste(cur_del_inv,"_cols",sep="")))]
+ Vm_inv=atcg_matrix[i,eval(as.name(paste(tolower(cur_del_inv),"_cols",sep="")))]
+ ma_count_inv=Vp_inv+Vm_inv
+ ref_inv=TRUE
+ reg_res=glmrob.nb(x=DP,y=ma_count_inv,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob)
+ } else { ref_inv=FALSE; reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob) }
+ # compute Qval20pc
+ qval_minAF = rep(0,nindiv)
+ somatic_status = rep(".",nindiv)
+ if(isTNpairs){
+ qval_minAF[Nindex] = sapply(1:length(Nindex),function(ii) toQvalueN(DP[Nindex][ii],reg_res,sigma) )
+ if(sum(reg_res$GQ[Nindex]>GQ_threshold)>0) qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueN(DP[Tindex][ii],reg_res,sigma) ) #GERMLINE VARIANT, check if power to CONFIRM in tumors
+ else qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueT(DP[Tindex][ii],reg_res,afmin_power) ) #no germline variant, check if power to call SOMATIC in tumors
+ if( length(onlyNindex)>0 ) qval_minAF[onlyNindex] = sapply(1:length(onlyNindex),function(ii) toQvalueN(DP[onlyNindex][ii],reg_res,sigma) )
+ if( length(onlyTindex)>0 ) qval_minAF[onlyTindex] = sapply(1:length(onlyTindex),function(ii) toQvalueT(DP[onlyTindex][ii],reg_res,afmin_power) )
+
+ #no matching normal -> UNKNOWN (impossible to call somatic status)
+ somatic_status[onlyTindex][(reg_res$GQ[onlyTindex] > GQ_threshold) ] = "UNKNOWN"
+ #no matching tumor -> GERMLINE_UNCONFIRMABLE (impossible to call somatic status)
+ somatic_status[onlyNindex][(reg_res$GQ[onlyNindex] > GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ #no tumor variant, no normal variant despite good power in both -> ".", confirmed; already set by default
+ #no tumor variant, no normal variant but low power in N and T-> ".", unknown; already set by default
+ #no tumor variant, no normal variant but low power in N or T-> ".", unconfirmed ; already set by default
+ #no tumor variant, normal variant with good power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMED"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ #no tumor variant, normal variant with low power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+
+ #tumor variant, no normal variant but without enough power -> UNKNOWN for both Tumor and Normal
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex] GQ_threshold)&(qval_minAF[Nindex] GERMLINE_CONFIRMED
+ somatic_status[Tindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ somatic_status[Nindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ #tumor variant, no normal variant despite good power -> SOMATIC
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex]>GQ_threshold)&(reg_res$GQ[Nindex]0 ) somatic_status[Tindex][somatic_status[Tindex] == "SOMATIC"] = paste("POSSIBLE_CONTAMINATION_FROM", paste(indiv_run[wh.germ,2],sep="_",collapse="_"),sep="_")
+ }else{# no pairs file supplied
+ if(afmin_power==-1 ){#no minimum frequency supplied -> use a negative binomial distribution to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueN(DP[ii],reg_res,sigma) )
+ }else{#minimum frequency supplied -> use a binomial distribution centered around afmin_power to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueT(DP[ii],reg_res,afmin_power) )
+ }
+ }
+
if (!is.na(reg_res$coef["slope"]) & sum(reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0) {
all_AO=sum(ma_count)
all_DP=sum(coverage_matrix[i,])+sum(ma_count)
common_annot()
all_RO=sum(Rp+Rm)
- if (SB_type=="SOR") sbs=sors else sbs=rvsbs
+ if (SB_type=="SOR") sbs=sors else { if (SB_type=="RVSB") sbs=rvsbs else {sbs=FisherStrand} }
if (sum(sbs<=SB_threshold_indel & reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0) {
con=pipe(paste(samtools," faidx ",fasta_ref," ",pos_ref[i,"chr"],":",pos_ref[i,"loc"]+1-3,"-",pos_ref[i,"loc"]+1-1," | tail -n1",sep=""))
- before=readLines(con)
+ before=toupper(readLines(con))
close(con)
con=pipe(paste(samtools," faidx ",fasta_ref," ",pos_ref[i,"chr"],":",pos_ref[i,"loc"]+1+nchar(cur_del),"-",pos_ref[i,"loc"]+1+3+nchar(cur_del)-1," | tail -n1",sep=""))
- after=readLines(con)
+ after=toupper(readLines(con))
close(con)
prev_bp=substr(before,3,3)
next_bp=substr(after,1,1)
cat(pos_ref[i,"chr"],"\t",pos_ref[i,"loc"],"\t",".","\t",paste(prev_bp,cur_del,sep=""),"\t",prev_bp,"\t",max(reg_res$GQ),"\t",".",sep = "",file=out_file,append=T)
# INFO field
- cat("\t","TYPE=del;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),sep="",file=out_file,append=T)
+ cat("\t","TYPE=del;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";FS=",FisherStrand_all,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),ifelse(reg_res$extra_rob & !(ref_inv),";WARN=EXTRA_ROBUST_GL",""),ifelse(reg_res$extra_rob & ref_inv,";WARN=EXTRA_ROBUST_GL/INV_REF",""),ifelse(ref_inv & !(reg_res$extra_rob),";WARN=INV_REF",""),sep="",file=out_file,append=T)
# FORMAT field
- cat("\t","GT:QVAL:DP:RO:AO:AF:SB:SOR:RVSB",sep = "",file=out_file,append=T)
+ cat("\t",paste("GT:",ifelse(ref_inv,"QVAL_INV","QVAL"),":DP:RO:AO:AF:SB:SOR:RVSB:FS:QVAL_minAF:STATUS",sep=""),sep = "",file=out_file,append=T)
# all samples
- genotype=rep("0/0",l=nindiv)
- variants=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_indel)
- genotype[variants]="0/1"
+ if(ref_inv) { genotype=rep("1/1",l=nindiv) } else { genotype=rep("0/0",l=nindiv) }
+ heterozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage < 0.75)
+ genotype[heterozygotes]="0/1"
+ homozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage >= 0.75)
+ if(ref_inv) { genotype[homozygotes]="0/0" } else { genotype[homozygotes]="1/1" }
+ #no tumor variant but low power -> "./."
+ genotype[(reg_res$GQ < GQ_threshold)&(qval_minAF% .("-"),")",sep="")),sbs=sbs, SB_threshold=SB_threshold_indel,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
+ if(!ref_inv & nchar(cur_del)>50) cur_del = paste(substr(cur_del,1,5+match(cur_del,uniq_del)),substr(cur_del,nchar(cur_del)-(5+match(cur_del,uniq_del)),nchar(cur_del)),sep="...")
+ if(ref_inv & nchar(ref)>50) ref = paste(substr(ref,1,5+match(ref,uniq_del)),substr(ref,nchar(ref)-(5+match(ref,uniq_del)),nchar(ref)),sep="...")
+ pdf(paste(pos_ref[i,"chr"],"_",pos_ref[i,"loc"],"_",pos_ref[i,"loc"]+nchar(cur_del)-1,"_",paste(prev_bp,cur_del,sep=""),"_",prev_bp,ifelse(ref_inv,"_inv_ref",""),ifelse(reg_res$extra_rob,"_extra_robust",""),".pdf",sep=""),7,6)
+ plot_rob_nb(reg_res, 10^-(GQ_threshold/10), plot_title=bquote(paste(.(pos_ref[i,"chr"]),":",.(pos_ref[i,"loc"])," (",.(paste(prev_bp,cur_del,sep="")) %->% .(prev_bp),")",.(ifelse(ref_inv," INV REF","")),.(ifelse(reg_res$extra_rob," EXTRA ROBUST","")),sep="")),sbs=sbs, SB_threshold=SB_threshold_indel,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
dev.off()
}
}
}
}
}
-
+
# INS
all_ins=ins[i,!is.na(ins[i,])]
if (length(all_ins)>0) {
all_ins=as.data.frame(strsplit(unlist(strsplit(paste(all_ins),split = "|",fixed=T)),split = ":",fixed=T),stringsAsFactors=F,)
uniq_ins=unique(toupper(as.character(all_ins[2,])))
- coverage_all_ins=sum(as.numeric(all_ins[1,]))
+ coverage_all_ins=sum(as.numeric(all_ins[1,]))
for (cur_ins in uniq_ins) {
Vp=rep(0,nindiv)
names(Vp)=indiv_run[,1]
@@ -664,40 +485,101 @@ for (i in 1:npos) {
ma_p_cur_ins=unlist(lapply(all_cur_ins_p,function(x) {as.numeric(gsub(paste("([0-9]+):",cur_ins,"$",sep=""),"\\1",x,ignore.case = T))}))
ma_m_cur_ins=unlist(lapply(all_cur_ins_m,function(x) {as.numeric(gsub(paste("([0-9]+):",cur_ins,"$",sep=""),"\\1",x,ignore.case = T))}))
Vp[names(ma_p_cur_ins)]=ma_p_cur_ins
- Vm[names(ma_m_cur_ins)]=ma_m_cur_ins
+ Vm[names(ma_m_cur_ins)]=ma_m_cur_ins
ma_count=Vp+Vm
- DP=coverage_matrix[i,]+ma_count[]
- reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads)
+ DP=coverage_matrix[i,]
+ if( sum( (ma_count/DP) > 0.8 , na.rm = T) > 0.5*length(ma_count) ){ #here we need to reverse alt and ref for the regression (use ma_count of the ref)
+ ref=pos_ref[i,"ref"]
+ cur_ins_inv=pos_ref[i,"ref"]
+ Vp_inv=atcg_matrix[i,eval(as.name(paste(cur_ins_inv,"_cols",sep="")))]
+ Vm_inv=atcg_matrix[i,eval(as.name(paste(tolower(cur_ins_inv),"_cols",sep="")))]
+ ma_count_inv=Vp_inv+Vm_inv
+ ref_inv=TRUE
+ reg_res=glmrob.nb(x=DP,y=ma_count_inv,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob)
+ } else { ref_inv=FALSE; reg_res=glmrob.nb(x=DP,y=ma_count,min_coverage=min_coverage,min_reads=min_reads,extra_rob=extra_rob) }
+ # compute Qval20pc
+ qval_minAF = rep(0,nindiv)
+ somatic_status = rep(".",nindiv)
+
+ if(isTNpairs){
+ qval_minAF[Nindex] = sapply(1:length(Nindex),function(ii) toQvalueN(DP[Nindex][ii],reg_res,sigma) )
+ if(sum(reg_res$GQ[Nindex]>GQ_threshold)>0) qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueN(DP[Tindex][ii],reg_res,sigma) ) #GERMLINE VARIANT, check if power to CONFIRM in tumors
+ else qval_minAF[Tindex] = sapply(1:length(Tindex),function(ii) toQvalueT(DP[Tindex][ii],reg_res,afmin_power) ) #no germline variant, check if power to call SOMATIC in tumors
+ if( length(onlyNindex)>0 ) qval_minAF[onlyNindex] = sapply(1:length(onlyNindex),function(ii) toQvalueN(DP[onlyNindex][ii],reg_res,sigma) )
+ if( length(onlyTindex)>0 ) qval_minAF[onlyTindex] = sapply(1:length(onlyTindex),function(ii) toQvalueT(DP[onlyTindex][ii],reg_res,afmin_power) )
+
+ #no matching normal -> UNKNOWN (impossible to call somatic status)
+ somatic_status[onlyTindex][(reg_res$GQ[onlyTindex] > GQ_threshold) ] = "UNKNOWN"
+ #no matching tumor -> GERMLINE_UNCONFIRMABLE (impossible to call somatic status)
+ somatic_status[onlyNindex][(reg_res$GQ[onlyNindex] > GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ #no tumor variant, no normal variant despite good power in both -> ".", confirmed; already set by default
+ #no tumor variant, no normal variant but low power in N and T-> ".", unknown; already set by default
+ #no tumor variant, no normal variant but low power in N or T-> ".", unconfirmed ; already set by default
+ #no tumor variant, normal variant with good power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMED"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]>GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_UNCONFIRMED"
+ #no tumor variant, normal variant with low power in T (with or without good power in N)-> "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Tindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+ somatic_status[Nindex][(reg_res$GQ[Tindex] < GQ_threshold)&(qval_minAF[Tindex]GQ_threshold) ] = "GERMLINE_UNCONFIRMABLE"
+
+ #tumor variant, no normal variant but without enough power -> UNKNOWN for both Tumor and Normal
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex] GQ_threshold)&(qval_minAF[Nindex] GERMLINE_CONFIRMED
+ somatic_status[Tindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ somatic_status[Nindex][ (reg_res$GQ[Tindex] > GQ_threshold)&(reg_res$GQ[Nindex]>GQ_threshold) ] = "GERMLINE_CONFIRMED"
+ #tumor variant, no normal variant despite good power -> SOMATIC
+ somatic_status[Tindex][(reg_res$GQ[Tindex] > GQ_threshold)&(qval_minAF[Nindex]>GQ_threshold)&(reg_res$GQ[Nindex]0 ) somatic_status[Tindex][somatic_status[Tindex] == "SOMATIC"] = paste("POSSIBLE_CONTAMINATION_FROM", paste(indiv_run[wh.germ,2],sep="_",collapse="_"),sep="_")
+ }else{# no pairs file supplied
+ if(afmin_power==-1 ){#no minimum frequency supplied -> use a negative binomial distribution to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueN(DP[ii],reg_res,sigma) )
+ }else{#minimum frequency supplied -> use a binomial distribution centered around afmin_power to check the power
+ qval_minAF = sapply(1:nindiv,function(ii) toQvalueT(DP[ii],reg_res,afmin_power) )
+ }
+ }
+
if (!is.na(reg_res$coef["slope"]) & sum(reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0) {
all_AO=sum(ma_count)
all_DP=sum(coverage_matrix[i,])+sum(ma_count)
common_annot()
all_RO=sum(Rp+Rm)
- if (SB_type=="SOR") sbs=sors else sbs=rvsbs
+ if (SB_type=="SOR") sbs=sors else { if (SB_type=="RVSB") sbs=rvsbs else {sbs=FisherStrand} }
if (sum(sbs<=SB_threshold_indel & reg_res$GQ>=GQ_threshold,na.rm=TRUE)>0) {
con=pipe(paste(samtools," faidx ",fasta_ref," ",pos_ref[i,"chr"],":",pos_ref[i,"loc"]-2,"-",pos_ref[i,"loc"]," | tail -n1",sep=""))
- before=readLines(con)
+ before=toupper(readLines(con))
close(con)
con=pipe(paste(samtools," faidx ",fasta_ref," ",pos_ref[i,"chr"],":",pos_ref[i,"loc"]+1,"-",pos_ref[i,"loc"]+3," | tail -n1",sep=""))
- after=readLines(con)
+ after=toupper(readLines(con))
close(con)
prev_bp=substr(before,3,3)
cat(pos_ref[i,"chr"],"\t",pos_ref[i,"loc"],"\t",".","\t",prev_bp,"\t",paste(prev_bp,cur_ins,sep=""),"\t",max(reg_res$GQ),"\t",".",sep = "",file=out_file,append=T)
# INFO field
- cat("\t","TYPE=ins;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),sep="",file=out_file,append=T)
+ cat("\t","TYPE=ins;NS=",sum(coverage_matrix[i,]>0),";AF=",sum(reg_res$GQ>=GQ_threshold)/sum(coverage_matrix[i,]>0),";DP=",all_DP,";RO=",all_RO,";AO=",all_AO,";SRF=",sum(Rp),";SRR=",sum(Rm),";SAF=",sum(Vp),";SAR=",sum(Vm),";SOR=",all_sor,";RVSB=",all_rvsb,";FS=",FisherStrand_all,";ERR=",reg_res$coef["slope"],";SIG=",reg_res$coef["sigma"],";CONT=",paste(before,after,sep="x"),ifelse(reg_res$extra_rob & !(ref_inv),";WARN=EXTRA_ROBUST_GL",""),ifelse(reg_res$extra_rob & ref_inv,";WARN=EXTRA_ROBUST_GL/INV_REF",""),ifelse(ref_inv & !(reg_res$extra_rob),";WARN=INV_REF",""),sep="",file=out_file,append=T)
# FORMAT field
- cat("\t","GT:QVAL:DP:RO:AO:AF:SB:SOR:RVSB",sep = "",file=out_file,append=T)
+ cat("\t",paste("GT:",ifelse(ref_inv,"QVAL_INV","QVAL"),":DP:RO:AO:AF:SB:SOR:RVSB:FS:QVAL_minAF:STATUS",sep=""),sep = "",file=out_file,append=T)
# all samples
- genotype=rep("0/0",l=nindiv)
- variants=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_indel)
- genotype[variants]="0/1"
+ if(ref_inv) { genotype=rep("1/1",l=nindiv) } else { genotype=rep("0/0",l=nindiv) }
+ heterozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage < 0.75)
+ genotype[heterozygotes]="0/1"
+ homozygotes=which(reg_res$GQ>=GQ_threshold & sbs<=SB_threshold_SNV & reg_res$ma_count/reg_res$coverage >= 0.75)
+ if(ref_inv) { genotype[homozygotes]="0/0" } else { genotype[homozygotes]="1/1" }
+ #no tumor variant but low power -> "./."
+ genotype[(reg_res$GQ < GQ_threshold)&(qval_minAF% .(cur_ins),")",sep="")),sbs=sbs, SB_threshold=SB_threshold_indel,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
+ if(!ref_inv & nchar(cur_ins)>50) cur_ins = paste(substr(cur_ins,1,5+match(cur_ins,uniq_ins)),substr(cur_ins,nchar(cur_ins)-(5+match(cur_ins,uniq_ins)),nchar(cur_ins)),sep="...")
+ if(ref_inv & nchar(ref)>50) ref = paste(substr(ref,1,5+match(ref,uniq_ins)),substr(ref,nchar(ref)-(5+match(ref,uniq_ins)),nchar(ref)),sep="...")
+ pdf(paste(pos_ref[i,"chr"],"_",pos_ref[i,"loc"],"_",pos_ref[i,"loc"],"_",prev_bp,"_",paste(prev_bp,cur_ins,sep=""),ifelse(ref_inv,"_inv_ref",""),ifelse(reg_res$extra_rob,"_extra_robust",""),".pdf",sep=""),7,6)
+ plot_rob_nb(reg_res, 10^-(GQ_threshold/10), plot_title=bquote(paste(.(pos_ref[i,"chr"]),":",.(pos_ref[i,"loc"])," (",.(prev_bp) %->% .(paste(prev_bp,cur_ins,sep="")),")",.(ifelse(ref_inv," INV REF","")),.(ifelse(reg_res$extra_rob," EXTRA ROBUST","")),sep="")),sbs=sbs, SB_threshold=SB_threshold_indel,plot_labels=plot_labels,add_contours=add_contours,names=indiv_run[,2])
dev.off()
}
}
diff --git a/bin/pileup2baseindel.pl b/bin/pileup2baseindel.pl
index 2cafdbc..f8a4ae3 100755
--- a/bin/pileup2baseindel.pl
+++ b/bin/pileup2baseindel.pl
@@ -65,6 +65,7 @@
my $line = ;
$line=~s/\r|\n//g;
my ($chr,$loc,$ref,@dp_bases_bq) = split /\s+/, $line;
+$ref = uc $ref;
my $n = int (scalar(@dp_bases_bq)/3); #determine how many samples, use int just in safe
my %files;
foreach my $i (1..$n){
@@ -85,6 +86,7 @@
while(){
s/\r|\n//g;
my ($chr,$loc,$ref,@dp_bases_bq) = split /\s+/;
+ $ref = uc $ref;
my $n = int (scalar(@dp_bases_bq)/3); #determine how many samples, use int just in safe
foreach my $i (1..$n){
my $fh = $files{$i};
diff --git a/bin/plot_rob_nb.r b/bin/plot_rob_nb.r
new file mode 100644
index 0000000..0a941b1
--- /dev/null
+++ b/bin/plot_rob_nb.r
@@ -0,0 +1,215 @@
+# Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see .
+
+plot_rob_nb <- function(rob_nb_res,qthreshold=0.01,plot_title=NULL,sbs,SB_threshold=Inf,names=NULL,plot_labels=FALSE,add_contours=FALSE){
+ n=sum(rob_nb_res$qvalue>qthreshold)
+ m=sum(rob_nb_res$qvalue<=qthreshold)
+
+ cut_max_qvals=100
+ cols=rep("black",length(rob_nb_res$coverage))
+ palette=rev(rainbow(cut_max_qvals+1,start=0, end=4/6))
+ col_indices = round(-10*log10(rob_nb_res$qvalues))+1
+ col_indices[which(col_indices>cut_max_qvals)]=cut_max_qvals+1
+ cols=palette[col_indices]
+ outliers_color=cols
+ outliers_color[which(sbs>SB_threshold)]="black"
+
+ temp_title = bquote(e==.(format(rob_nb_res$coef[[2]],digits = 2)) ~ "," ~ sigma==.(format(rob_nb_res$coef[[1]],digits = 2)))
+ plot(rob_nb_res$coverage, rob_nb_res$ma_count,
+ pch=21,bg=cols,col=outliers_color,xlab="Coverage (DP)",ylab="Number of ALT reads (AO)",
+ main=plot_title, xlim=c(0,max(rob_nb_res$coverage,na.rm = T)),ylim=c(0,max(rob_nb_res$ma_count, na.rm = T)))
+ mtext(temp_title)
+ #### labeling outliers
+ if(!is.null(names) & plot_labels & length(names[which(rob_nb_res$qvalues<=qthreshold)]) > 0) {
+ text(rob_nb_res$coverage[which(rob_nb_res$qvalues<=qthreshold)], rob_nb_res$ma_count[which(rob_nb_res$qvalues<=qthreshold)],
+ labels=names[which(rob_nb_res$qvalues<=qthreshold)], cex= 0.6, pos=1)
+ }
+
+ #### plot the color palette
+ plot_palette <- function(topright=FALSE) {
+ xmin <- par("usr")[1]
+ xmax <- par("usr")[2]
+ ymin <- par("usr")[3]
+ ymax <- par("usr")[4]
+ xright=xmin+(xmax-xmin)*(1-0.9)
+ xleft=xmin+(xmax-xmin)*(1-0.94)
+ if(topright){xright=xmin+(xmax-xmin)*0.9;xleft=xmin+(xmax-xmin)*0.94}
+ ybottom=ymin+(ymax-ymin)*0.72
+ ytop=ymin+(ymax-ymin)*0.94
+
+ rasterImage(as.raster(matrix(rev(palette), ncol=1)),xright ,ybottom ,xleft,ytop )
+ rect(xright ,ybottom ,xleft,ytop )
+ text(x=(xright+xleft)/2, y = ytop+(ytop-ybottom)*0.1, labels = "QVAL", cex=0.8)
+ keep_labels=seq(0,cut_max_qvals,by=20)
+ keep_labels_pos=seq(ybottom,ytop,l=length(keep_labels))
+ tick_width=-(xleft-xright)/5
+ for (i in 1:length(keep_labels)) {
+ if (topright) {
+ lines(c(xright,xright+tick_width),c(keep_labels_pos[i],keep_labels_pos[i]))
+ } else {
+ lines(c(xleft,xleft-tick_width),c(keep_labels_pos[i],keep_labels_pos[i]))
+ }
+ }
+ if(topright) {
+ text(x=xright+1.5*tick_width, y = keep_labels_pos, labels = keep_labels,adj = c(1,0.5), cex = 0.8)
+ } else {
+ text(x=(xright-(xright-xleft))*0.3, y = keep_labels_pos, labels = keep_labels,adj = c(1,0.5), cex = 0.8)
+ }
+ }
+ plot_palette()
+ ####
+
+ if (!is.na(rob_nb_res$coef["slope"])) {
+ ################### ADD CONTOURS ##################
+ max_nb_grid_pts = 2500
+ max_qvalue=100
+ qlevels = c(10,30,50,70,100)
+ # following function returns a qvalue for a new point by adding it in the set of observations, recomputing all qvalues and taking its corresponding value.
+ toQvalue <- function(x,y){
+ if(y>x) return(0)
+ unlist(-10*log10(p.adjust((dnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x)) +
+ pnbinom(c(rob_nb_res$ma_count,y),size=1/rob_nb_res$coef[[1]],mu=rob_nb_res$coef[[2]]*c(rob_nb_res$coverage,x),lower.tail = F)))
+ [length(rob_nb_res$coverage)+1]))
+ }
+ #here we compute the dimension of the qvalue grid (ylength*xlength), with min(ylength)=5 (this avoids a too "flat" grid)
+ #if needed to sampling (too large grid if dimension=max(AO)*max(DP)), we verify two equations: equality of ratios ylength/xlength before and after sampling and ylength*xlength=max_nb_grid_pts
+ #### compute zoom y limit
+ if(add_contours){
+ nb_pts_zoom_computation=1000
+ if(max(rob_nb_res$coverage,na.rm=T) > nb_pts_zoom_computation){
+ maxDP_AO = unique(sort(c(round(max(rob_nb_res$coverage,na.rm=T)*rbeta(nb_pts_zoom_computation,1,100)),runif(100,1,max(rob_nb_res$coverage,na.rm=T)))))
+ } else {
+ maxDP_AO = seq(1,max(rob_nb_res$coverage,na.rm=T),by=1)
+ }
+ maxDP_qvals = unlist(lapply(maxDP_AO,function(AO){ toQvalue(x=max(rob_nb_res$coverage,na.rm=T),y=AO) }))
+ af_min_lim = log10(maxDP_AO[which(maxDP_qvals>=qlevels[1])[1]]/max(rob_nb_res$coverage,na.rm=T))
+ if(is.na(af_min_lim)) af_min_lim=0 #if maxDP_qvals does not contain qlevels[1]
+ ylim_zoom = maxDP_AO[which(maxDP_qvals>=max_qvalue)[1]]
+ ylim_zoom_cor=ifelse(is.na(ylim_zoom),max(rob_nb_res$coverage,na.rm=T),ylim_zoom)
+ if(!is.na(ylim_zoom_cor)){ #ylim_zoom is na iff we found at least one qvalue >= max_qvalue (if error rate closed to 1, only qvalues closed to 0)
+ #### compute dim of the qvalue grid
+ if(ylim_zoom_cor*max(rob_nb_res$coverage,na.rm=T) <= max_nb_grid_pts){
+ xgrid = seq(0,max(rob_nb_res$coverage,na.rm=T), by=1)
+ ygrid = seq(0,ylim_zoom_cor,by=1) #use by=1 to have integer, if not dnbinom not happy
+ } else {
+ if(ylim_zoom_cor<=50){
+ ygrid=seq(0,ylim_zoom_cor,by=1)
+ xlength = round(max_nb_grid_pts/length(ygrid))
+ xgrid = round(seq(0,max(rob_nb_res$coverage,na.rm=T),length=xlength))
+ } else {
+ ylength = 50
+ xlength = round(max_nb_grid_pts/ylength)
+ xgrid = round(seq(0,max(rob_nb_res$coverage,na.rm=T),length=xlength))
+ ygrid = round(seq(0,ylim_zoom_cor,length=ylength))
+ }
+ }
+ #here we initiate the grid with each case containing list=(DP,AO) from xgrid, ygrid
+ matgrid = array(as.list(as.data.frame(t(expand.grid(xgrid,ygrid)))),dim=c(length(xgrid),length(ygrid)))
+ #then we fill in the grid with qvalues for each pair of AO,DP taken from ygrid,xgrid vectors (we use initiated values to identify corresponding AO,DP). Finally we plot the contours.
+ matgrid=matrix(sapply(matgrid,function(case) toQvalue(unlist(case)[1],unlist(case)[2])), length(xgrid),length(ygrid))
+ #### plot the contour "by hands"
+ for(qvalue in qlevels) {
+ dat = na.omit(cbind(xgrid,unlist(lapply(xgrid,function(DP,ygrid,xgrid){
+ if(sum(matgrid[match(DP,xgrid),]>=qvalue)==0) {
+ qval = NA } else {
+ qval = min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
+ } #NA iff no case>=qvalue in matgrid at DP
+ if (is.na(qval)) {
+ if( DP == 0 ) { AO = 0 } else { AO = NA }
+ } else {
+ AO = min(ygrid[which(matgrid[match(DP,xgrid),]==qval)])
+ }
+ AO },ygrid,xgrid))))
+ lines(dat[,1],dat[,2],col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)],lwd=1.3,lty=3)
+ }
+ }
+ }
+ #### plot confidence interval + error rate
+ xi=max(rob_nb_res$coverage,na.rm=T)
+ yi1=qnbinom(p=0.99, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*xi)
+ yi2=qnbinom(p=0.01, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*xi)
+ if(max(rob_nb_res$coverage,na.rm=T)>1000) { DP_for_IC = round(seq(0,max(rob_nb_res$coverage,na.rm=T),length=1000)) } else { DP_for_IC = seq(0,max(rob_nb_res$coverage,na.rm=T),by=1) }
+ lines(DP_for_IC,qnbinom(p=0.99, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*DP_for_IC),col="black",lty=3,lwd=2)
+ lines(DP_for_IC,qnbinom(p=0.01, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*DP_for_IC),col="black",lty=3,lwd=2)
+ abline(a=0, b=rob_nb_res$coef[[2]], col="black")
+ #### plot zoom on max qvalue if add_contours, otherwise on 2*IC
+ if(!add_contours) ylim_zoom_cor = 2*yi1
+ plot(rob_nb_res$coverage, rob_nb_res$ma_count,
+ pch=21,bg=cols,col=outliers_color,xlab="Coverage (DP)",ylab="Number of ALT reads (AO)",
+ main=plot_title, ylim=c(0,ylim_zoom_cor), xlim=c(0,max(rob_nb_res$coverage,na.rm=T)))
+ #### plot the contour "by hands"
+ if(add_contours){
+ if(!is.na(ylim_zoom_cor)){
+ for(qvalue in qlevels) {
+ dat = na.omit(cbind(xgrid,unlist(lapply(xgrid,function(DP,ygrid,xgrid){
+ if(sum(matgrid[match(DP,xgrid),]>=qvalue)==0) {
+ qval = NA } else {
+ qval = min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
+ } #NA iff no case>=qvalue in matgrid at DP
+ if (is.na(qval)) {
+ if( DP == 0 ) { AO = 0 } else { AO = NA }
+ } else {
+ AO = min(ygrid[which(matgrid[match(DP,xgrid),]==qval)])
+ }
+ AO },ygrid,xgrid))))
+ lines(dat[,1],dat[,2],col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)],lwd=1.3,lty=3) }
+ }
+ }
+ #contour(xgrid, ygrid, matgrid, levels=qlevels , col = rev(rainbow(length(qlevels),start=0, end=4/6)), add=T, lwd = 1.3, labcex = 0.8, lty=3)
+ mtext(paste("zoom on maximum q-value =",max_qvalue))
+ #### labeling outliers and plot confidence interval + error rate
+ if(!is.null(names) & plot_labels & length(names[which(rob_nb_res$qvalues<=qthreshold)]) > 0) {
+ text(rob_nb_res$coverage[which(rob_nb_res$qvalues<=qthreshold)], rob_nb_res$ma_count[which(rob_nb_res$qvalues<=qthreshold)],
+ labels=names[which(rob_nb_res$qvalues<=qthreshold)], cex= 0.6, pos=1)
+ }
+ lines(DP_for_IC,qnbinom(p=0.99, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*DP_for_IC),col="black",lty=3,lwd=2)
+ lines(DP_for_IC,qnbinom(p=0.01, size=1/rob_nb_res$coef[[1]], mu=rob_nb_res$coef[[2]]*DP_for_IC),col="black",lty=3,lwd=2)
+ abline(a=0, b=rob_nb_res$coef[[2]], col="black")
+ plot_palette()
+
+ plot(rob_nb_res$GQ,log10(rob_nb_res$ma_count/rob_nb_res$coverage),pch=21,bg=cols,col=outliers_color,ylab=bquote("log"[10] ~ "[Allelic Fraction (AF)]"),xlab="QVAL",main="Allelic fraction effect")
+ abline(v=-10*log10(qthreshold),col="red",lwd=2)
+ plot_palette()
+ ylim_zoom_af = ifelse(add_contours, ylim_zoom_cor/max(rob_nb_res$coverage,na.rm=T), (2*yi1)/xi)
+ plot(rob_nb_res$GQ,log10(rob_nb_res$ma_count/rob_nb_res$coverage),pch=21,bg=cols,col=outliers_color,ylab=bquote("log"[10] ~ "[Allelic Fraction (AF)]"),xlab="QVAL",main="Allelic fraction effect",
+ ylim=c(min(log10(rob_nb_res$ma_count/rob_nb_res$coverage)[is.finite(log10(rob_nb_res$ma_count/rob_nb_res$coverage))]),log10(ylim_zoom_af)), xlim=c(0,100))
+ if(add_contours) { mtext(paste("zoom on maximum q-value =",max_qvalue)) } else { mtext("zoom on 99% confidence interval") }
+ abline(v=-10*log10(qthreshold),col="red",lwd=2)
+ plot_palette()
+ if(add_contours){
+ if(!is.na(ylim_zoom_cor)){
+ #### plot min(af) ~ DP
+ plot(1,type='n', ylim=c(1.1*af_min_lim,0), xlim=range(xgrid), xlab="DP", ylab=bquote("log"[10] ~ "[min(AF)]"))
+ for(qvalue in qlevels) {
+ lines(xgrid, unlist(lapply(xgrid,function(DP,ygrid,xgrid){
+ if(sum(matgrid[match(DP,xgrid),]>=qvalue)==0) {
+ qval = NA } else {
+ qval = min(matgrid[match(DP,xgrid),which(matgrid[match(DP,xgrid),]>=qvalue)])
+ } #NA iff no case>=qvalue in matgrid at DP
+ if (is.na(qval)) {
+ af = 0
+ } else {
+ af = min(ygrid[which(matgrid[match(DP,xgrid),]==qval)])/DP
+ }
+ if(DP==0 || af>1) { af=1 } #af>1 if min(...)>DP
+ log10(af) },ygrid,xgrid)),col=rev(rainbow(length(qlevels),start=0, end=4/6))[match(qvalue,qlevels)])
+ }
+ plot_palette(topright = TRUE)
+ #hist(rob_nb_res$pvalues,main="p-values distribution",ylab="Density",xlab="p-value",col="grey",freq=T,br=20,xlim=c(0,1))
+ #hist(rob_nb_res$qvalues,main="q-values distribution",breaks=20,xlab="q-value",col="grey",freq=T,xlim=c(0,1))
+ }
+ }
+ }
+}
diff --git a/circle.yml b/circle.yml
index a382945..54ae83c 100644
--- a/circle.yml
+++ b/circle.yml
@@ -12,47 +12,17 @@ test:
cd ~/needlestack/docker
docker build -t iarcbioinfo/needlestack .
- | #use bed TP53_all.bed
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_all.bed --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test1.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test1
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test1/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_all.bed --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/ && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test1.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test1 && cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test1/ && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
- | #use bed TP53_exon2_11.bed
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_exon2_11.bed --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test2.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test2
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test2/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_exon2_11.bed --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/ && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test2.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test2 && cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test2/ && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
- | #test option --nsplit 1
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_exon2_11.bed --nsplit 1 --fasta_ref 17.fasta.gz --bam_folder BAM/
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test3.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test3
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test3/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --bed TP53_exon2_11.bed --nsplit 1 --fasta_ref 17.fasta.gz --bam_folder BAM/ && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test3.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test3 && cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test3/ && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
- | #test --min_AO 0 and --all_SNVs on a null AO position
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7572816-7572816 --fasta_ref 17.fasta.gz --bam_folder BAM/ --min_AO 0 --all_SNVs
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test4.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test4
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test4/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7572816-7572816 --fasta_ref 17.fasta.gz --bam_folder BAM/ --min_AO 0 --all_SNVs && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test4.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test4 && cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test4/ && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
- | #test --no_plots --use_file_name --no_indels on TP53 exon 6
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7578176-7578288 --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/ --no_plots --use_file_name --no_indels
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test5.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test5
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test5/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7578176-7578288 --nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/ --no_plots --use_file_name --no_indels && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test5.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test5 && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf
- | #test on a null coverage region
- cd ~/NGS_data_test/1000G_CEU_TP53/
- nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7580000-7580004--nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/
- cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test6.vcf
- mkdir -p results/$CIRCLE_BRANCH/PDF/test6
- cp BAM/PDF/*.pdf results/$CIRCLE_BRANCH/PDF/test6/
- rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf BAM/PDF/
+ cd ~/NGS_data_test/1000G_CEU_TP53/ && nextflow run ~/needlestack/needlestack.nf -with-docker iarcbioinfo/needlestack --region 17:7580000-7580004--nsplit 4 --fasta_ref 17.fasta.gz --bam_folder BAM/ && cp BAM/all_variants.vcf results/$CIRCLE_BRANCH/test6.vcf && mkdir -p results/$CIRCLE_BRANCH/PDF/test6 && sudo rm -rf work/ .nextflow.* trace.txt* timeline.html* BAM/all_variants.vcf
deployment:
git:
branch: [master, dev]
diff --git a/deploy.sh b/deploy.sh
index 83b5e67..b42c1e8 100755
--- a/deploy.sh
+++ b/deploy.sh
@@ -9,12 +9,12 @@ git push origin master
gem install github_changelog_generator
cd ~/needlestack
-github_changelog_generator IARCbioinfo/needlestack --bug-labels bug,"minor bug" --no-pull-requests --future-release v0.3
+github_changelog_generator IARCbioinfo/needlestack --bug-labels bug,"minor bug" --no-pull-requests --future-release v1.0
git config --global user.email "follm@iarc.fr"
git config --global user.name "Circle CI_$CIRCLE_PROJECT_REPONAME_$CIRCLE_BRANCH"
git add .
git status
-git commit -m "Added CHANGELOG [ci skip]"
+git commit -m "Added CHANGELOG"
git push origin $CIRCLE_BRANCH
curl -H "Content-Type: application/json" --data "{\"source_type\": \"Branch\", \"source_name\": \"$CIRCLE_BRANCH\"}" -X POST https://registry.hub.docker.com/u/iarcbioinfo/needlestack/trigger/fcd71f24-0f1e-40b8-b76a-09eb5a39e924/
diff --git a/needlestack.nf b/needlestack.nf
index 8d26a9d..74dfda4 100644
--- a/needlestack.nf
+++ b/needlestack.nf
@@ -1,7 +1,7 @@
#! /usr/bin/env nextflow
// needlestack: a multi-sample somatic variant caller
-// Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme
+// Copyright (C) 2017 Matthieu Foll, Tiffany Delhomme and Nicolas Alcala
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU General Public License as published by
@@ -24,38 +24,53 @@
// - needlestack.r (in bin folder)
// - pileup2baseindel.pl (in bin folder) (+ perl)
-params.min_dp = 50 // minimum coverage in at least one sample to consider a site
-params.min_ao = 5 // minimum number of non-ref reads in at least one sample to consider a site
+params.help = null
+params.input_vcf = null
+params.out_vcf = null
+params.region = null
+params.bed = null
+params.out_annotated_vcf = null
+params.min_dp = 30 // minimum median coverage to consider a site
+params.min_ao = 3 // minimum number of non-ref reads in at least one sample to consider a site
params.nsplit = 1 // split the positions for calling in nsplit pieces and run in parallel
params.min_qval = 50 // qvalue in Phred scale to consider a variant
-// http://gatkforums.broadinstitute.org/discussion/5533/strandoddsratio-computation filter out SOR > 4 for SNVs and > 10 for indels
-// filter out RVSB > 0.85 (maybe less stringent for SNVs)
params.sb_type = "SOR" // strand bias measure to be used: "SOR" or "RVSB"
-params.sb_snv = 100 // strand bias threshold for snv
-params.sb_indel = 100 // strand bias threshold for indels
-params.map_qual = 20 // min mapping quality (passed to samtools)
-params.base_qual = 20 // min base quality (passed to samtools)
-params.max_DP = 30000 // downsample coverage per sample (passed to samtools)
+if(params.sb_type in ["SOR", "RVSB"] ) {
+ params.sb_snv = 100 // strand bias threshold for snv
+ params.sb_indel = 100 // strand bias threshold for indels
+} else {
+ params.sb_snv = 1000 // strand bias threshold for snv
+ params.sb_indel = 1000 // strand bias threshold for indels
+}
+params.power_min_af = -1 // minimum allelic fraction for power computations
+params.sigma_normal = 0.1 // sigma parameter for negative binomial modeling germline mutations
+params.map_qual = 0 // min mapping quality (passed to samtools)
+params.base_qual = 13 // min base quality (passed to samtools)
+params.max_DP = 50000 // downsample coverage per sample (passed to samtools)
params.use_file_name = false //put these argument to use the bam file names as sample names and do not to use the sample name filed from the bam files (SM tag)
params.all_SNVs = false // output all sites, even when no variant is detected
+params.extra_robust_gl = false // perform an extra robust regression basically for germline variants
params.no_plots = false // do not produce pdf plots of regressions
-params.out_folder = params.bam_folder // if not provided, outputs will be held on the input bam folder
params.no_indels = false // do not skip indels
params.no_labels = false // label outliers
params.no_contours = false // add contours to the plots and plot min(AF)~DP
+params.pairs_file = "FALSE" // by default R will get a false boolean value for pairs_file option
+assert (params.pairs_file != true) : "please enter a file name when using --pairs_file option"
+if (params.pairs_file != "FALSE") { try { assert file(params.pairs_file).exists() : "\n WARNING : input tumor-normal pairs file not located in execution directory" } catch (AssertionError e) { println e.getMessage() } }
+pairs_file = file(params.pairs_file)
/* If --help in parameters, print software usage */
if (params.help) {
log.info ''
- log.info '-------------------------------------------------------'
- log.info 'NEEDLESTACK v0.3: A MULTI-SAMPLE SOMATIC VARIANT CALLER'
- log.info '-------------------------------------------------------'
- log.info 'Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme'
+ log.info '--------------------------------------------------------'
+ log.info 'NEEDLESTACK v1.0: A MULTI-SAMPLE SOMATIC VARIANT CALLER'
+ log.info '--------------------------------------------------------'
+ log.info 'Copyright (C) 2017 Matthieu Foll, Tiffany Delhomme and Nicolas Alcala'
log.info 'This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.txt'
log.info 'This is free software, and you are welcome to redistribute it'
log.info 'under certain conditions; see LICENSE.txt for details.'
- log.info '-------------------------------------------------------'
+ log.info '--------------------------------------------------------'
log.info ''
log.info 'Usage: '
log.info ' nextflow run iarcbioinfo/needlestack [-with-docker] --bed bedfile.bed --bam_folder BAM/ --fasta_ref reference.fasta [other options]'
@@ -63,302 +78,431 @@ if (params.help) {
log.info 'Mandatory arguments:'
log.info ' --bam_folder BAM_DIR BAM files directory.'
log.info ' --fasta_ref REF_IN_FASTA Reference genome in fasta format.'
+ log.info ' OR '
+ log.info ' --input_vcf VCF FILE VCF file (basically from GATK pipeline) to annotate.'
log.info 'Options:'
log.info ' --nsplit INTEGER Split the region for calling in nsplit pieces and run in parallel.'
- log.info ' --min_dp INTEGER Minimum coverage in at least one sample to consider a site.'
+ log.info ' --min_dp INTEGER Minimum median coverage (in addition, min_dp in at least 10 samples).'
log.info ' --min_ao INTEGER Minimum number of non-ref reads in at least one sample to consider a site.'
log.info ' --min_qval VALUE Qvalue in Phred scale to consider a variant.'
log.info ' --sb_type SOR or RVSB Strand bias measure.'
log.info ' --sb_snv VALUE Strand bias threshold for SNVs.'
log.info ' --sb_indel VALUE Strand bias threshold for indels.'
+ log.info ' --power_min_af VALUE Minimum allelic fraction for power computations.'
+ log.info ' --sigma_normal VALUE Sigma parameter for negative binomial modeling of expected germline allelic fraction.'
log.info ' --map_qual VALUE Samtools minimum mapping quality.'
log.info ' --base_qual VALUE Samtools minimum base quality.'
log.info ' --max_DP INTEGER Samtools maximum coverage before downsampling.'
log.info ' --use_file_name Sample names are taken from file names, otherwise extracted from the bam file SM tag.'
log.info ' --all_SNVs Output all SNVs, even when no variant found.'
+ log.info ' --extra_robust_gl Perform an extra robust regression, basically for germline variants'
log.info ' --no_plots Do not output PDF regression plots.'
log.info ' --no_labels Do not add labels to outliers in regression plots.'
log.info ' --no_indels Do not call indels.'
log.info ' --no_contours Do not add contours to plots and do not plot min(AF)~DP.'
log.info ' --out_folder OUTPUT FOLDER Output directory, by default input bam folder.'
+ log.info ' --out_vcf OUTPUT VCF NAME Output VCF name, by default all_variants.vcf.'
log.info ' --bed BED FILE A BED file for calling.'
log.info ' --region CHR:START-END A region for calling.'
+ log.info ' --pairs_file TEXT FILE A tab-delimited file containing two columns (normal and tumor sample name) for each sample in line.'
log.info ''
exit 1
}
-assert (params.fasta_ref != true) && (params.fasta_ref != null) : "please specify --fasta_ref option (--fasta_ref reference.fasta(.gz))"
-assert (params.bam_folder != true) && (params.bam_folder != null) : "please specify --bam_folder option (--bam_folder bamfolder)"
-
-fasta_ref = file( params.fasta_ref )
-fasta_ref_fai = file( params.fasta_ref+'.fai' )
-fasta_ref_gzi = file( params.fasta_ref+'.gzi' )
-
-/* Verify user inputs are correct */
-
-assert params.sb_type in ["SOR","RVSB"] : "--sb_type must be SOR or RVSB "
-assert params.all_SNVs in [true,false] : "do not assign a value to --all_SNVs"
-assert params.no_plots in [true,false] : "do not assign a value to --no_plots"
-assert params.no_indels in [true,false] : "do not assign a value to --no_indels"
-assert params.use_file_name in [true,false] : "do not assign a value to --use_file_name"
-if (params.bed) { try { assert file(params.bed).exists() : "\n WARNING : input bed file not located in execution directory" } catch (AssertionError e) { println e.getMessage() } }
-try { assert fasta_ref.exists() : "\n WARNING : fasta reference not located in execution directory. Make sure reference index is in the same folder as fasta reference" } catch (AssertionError e) { println e.getMessage() }
-if (fasta_ref.exists()) {assert fasta_ref_fai.exists() : "input fasta reference does not seem to have a .fai index (use samtools faidx)"}
-if (fasta_ref.exists() && params.fasta_ref.tokenize('.')[-1] == 'gz') {assert fasta_ref_gzi.exists() : "input gz fasta reference does not seem to have a .gzi index (use samtools faidx)"}
-try { assert file(params.bam_folder).exists() : "\n WARNING : input BAM folder not located in execution directory" } catch (AssertionError e) { println e.getMessage() }
-assert file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() > 0 : "BAM folder contains no BAM"
-if (file(params.bam_folder).exists()) {
- if (file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() < 10) {println "\n ERROR : BAM folder contains less than 10 BAM, exit."; System.exit(0)}
- else if (file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() < 20) {println "\n WARNING : BAM folder contains less than 20 BAM, method accuracy not warranted."}
- bamID = file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.collect { it.getName() }.collect { it.replace('.bam','') }
- baiID = file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam.bai/ }.collect { it.getName() }.collect { it.replace('.bam.bai','') }
- assert baiID.containsAll(bamID) : "check that every bam file has an index (.bam.bai)"
-}
-assert (params.min_dp > 0) : "minimum coverage must be higher than 0 (--min_dp)"
-assert (params.max_DP > 1) : "maximum coverage before downsampling must be higher than 1 (--max_DP)"
-assert (params.min_ao >= 0) : "minimum alternative reads must be higher or equals to 0 (--min_ao)"
-assert (params.nsplit > 0) : "number of regions to split must be higher than 0 (--nsplit)"
-assert (params.min_qval > 0) : "minimum Phred-scale qvalue must be higher than 0 (--min_qval)"
-assert (params.sb_snv > 0 && params.sb_snv < 101) : "strand bias for SNVs must be in [0,100]"
-assert (params.sb_indel > 0 && params.sb_indel < 101) : "strand bias for indels must be in [0,100]"
-assert (params.map_qual >= 0) : "minimum mapping quality (samtools) must be higher than or equals to 0"
-assert (params.base_qual >= 0) : "minimum base quality (samtools) must be higher than or equals to 0"
-
-sample_names = params.use_file_name ? "FILE" : "BAM"
-out_vcf = params.out_vcf ? params.out_vcf : "all_variants.vcf"
-
-/* manage input positions to call (bed or region or whole-genome) */
-if(params.region){
- input_region = 'region'
- } else if (params.bed){
- input_region = 'bed'
- bed = file(params.bed)
- } else {
- input_region = 'whole_genome'
- }
/* Software information */
-
log.info ''
-log.info '-------------------------------------------------------'
-log.info 'NEEDLESTACK v0.3: A MULTI-SAMPLE SOMATIC VARIANT CALLER'
-log.info '-------------------------------------------------------'
-log.info 'Copyright (C) 2015 Matthieu Foll and Tiffany Delhomme'
+log.info '--------------------------------------------------------'
+log.info 'NEEDLESTACK v1.0: A MULTI-SAMPLE SOMATIC VARIANT CALLER'
+log.info '--------------------------------------------------------'
+log.info 'Copyright (C) 2017 Matthieu Foll, Tiffany Delhomme and Nicolas Alcala'
log.info 'This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.txt'
log.info 'This is free software, and you are welcome to redistribute it'
log.info 'under certain conditions; see LICENSE.txt for details.'
-log.info '-------------------------------------------------------'
-log.info "Input BAM folder (--bam_folder) : ${params.bam_folder}"
-log.info "Reference in fasta format (--fasta_ref) : ${params.fasta_ref}"
-log.info "Intervals for calling (--bed) : ${input_region}"
-log.info "Number of regions to split (--nsplit) : ${params.nsplit}"
+log.info '--------------------------------------------------------'
+log.info(params.pairs_file == "FALSE" ? "Perform a tumor-normal somatic variant calling (--pairs_file) : no" : "Perform a tumor-normal somatic variant calling (--pairs_file) : yes (file ${params.pairs_file})" )
log.info "To consider a site for calling:"
-log.info " minimum coverage (--min_dp) : ${params.min_dp}"
+log.info " minimum median coverage (--min_dp) : ${params.min_dp}"
log.info " minimum of alternative reads (--min_ao) : ${params.min_ao}"
log.info "Phred-scale qvalue threshold (--min_qval) : ${params.min_qval}"
-log.info "Strand bias measure (--sb_type) : ${params.sb_type}"
-log.info "Strand bias threshold for SNVs (--sb_snv) : ${params.sb_snv}"
-log.info "Strand bias threshold for indels (--sb_indel) : ${params.sb_indel}"
-log.info "Samtools minimum mapping quality (--map_qual) : ${params.map_qual}"
-log.info "Samtools minimum base quality (--base_qual) : ${params.base_qual}"
-log.info "Samtools maximum coverage before downsampling (--max_DP) : ${params.max_DP}"
-log.info "Sample names definition (--use_file_name) : ${sample_names}"
-log.info(params.all_SNVs == true ? "Output all SNVs (--all_SNVs) : yes" : "Output all SNVs (--all_SNVs) : no" )
log.info(params.no_plots == true ? "PDF regression plots (--no_plots) : no" : "PDF regression plots (--no_plots) : yes" )
log.info(params.no_labels == true ? "Labeling outliers in regression plots (--no_labels) : no" : "Labeling outliers in regression plots (--no_labels) : yes" )
log.info(params.no_contours == true ? "Add contours in plots and plot min(AF)~DP (--no_contours) : no" : "Add contours in plots and plot min(AF)~DP (--no_contours) : yes" )
-log.info(params.no_indels == true ? "Skip indels (--no_indels) : yes" : "Skip indels (--no_indels) : no" )
-log.info "output folder (--out_folder) : ${params.out_folder}"
-log.info "\n"
-
-bam = Channel.fromPath( params.bam_folder+'/*.bam' ).toList()
-bai = Channel.fromPath( params.bam_folder+'/*.bam.bai' ).toList()
-
-/* Building the bed file where calling would be done */
-process bed {
- output:
- file "temp.bed" into outbed
-
- script:
- if (input_region == 'region')
- """
- echo $params.region | sed -e 's/[:|-]/\t/g' > temp.bed
- """
-
- else if (input_region == 'bed')
- """
- ln -s $bed temp.bed
- """
-
- else if (input_region == 'whole_genome')
- """
- cat $fasta_ref_fai | awk '{print \$1"\t"0"\t"\$2 }' > temp.bed
- """
-}
+if(params.input_vcf) {
+
+ params.out_folder = "annotated_vcf"
+ params.chunk_size = 10000
+ input_vcf = file(params.input_vcf)
+ params.out_annotated_vcf = null
+ out_annotated_vcf = params.out_annotated_vcf ? params.out_annotated_vcf : "annotated.vcf"
+ assert params.extra_robust_gl in [true,false] : "do not assign a value to --extra_robust_gl"
+
+ log.info "Number of vcf chunks for parallel computing (--nsplit) : ${params.nsplit}"
+ log.info "Size of read chunks by VariantAnnotation (--chunk_size) : ${params.chunk_size}"
+ log.info "Input vcf for annotation by needlestack (--input_vcf) : ${params.input_vcf}"
+ log.info "Output annotated file (--out_annotated_vcf) : ${out_annotated_vcf}"
+ log.info(params.extra_robust_gl == true ? "Perform an extra-robust regression (--extra_robust_gl) : yes" : "Perform an extra-robust regression (--extra_robust_gl) : no" )
+ log.info "output folder (--out_folder) : ${params.out_folder}"
+
+ log.info "\n"
+
+ process split_vcf {
+
+ input:
+ file input_vcf
+
+ output:
+ file 'split*' into splitted_vcf mode flatten
+
+ shell:
+ '''
+ zcat !{input_vcf} | sed '/^#CHROM/q' | grep -v "" > header
+ ((nb_total_lines= $((`zcat !{input_vcf} | wc -l`)) ))
+ ((core_lines = $nb_total_lines - $((`cat header | wc -l`)) ))
+ ((lines_per_file = ( $core_lines + !{params.nsplit} - 1) / !{params.nsplit}))
+ ((start=( $((`cat header | wc -l`)) +1 ) ))
+
+ # this works only with split (coreutils) version > 8.13 but is much faster
+ zcat !{input_vcf} | tail -n+$start | split -l $lines_per_file -a 10 --filter='{ cat header; cat; } | bgzip > $FILE.gz' - split_
+
+ ## slower but does not require any specific version of split
+ #for i in `seq 1 !{params.nsplit}`;
+ # do
+ # if ((start < nb_total_lines)); then
+ # { cat header && zcat !{input_vcf} | tail -n+$start | head -n$lines_per_file ; } | bgzip > split${i}.vcf.gz
+ # ((start=start+lines_per_file))
+ # fi
+ #done
+ '''
+ }
-/* split bed file into nsplit regions */
-process split_bed {
+ process annotate_vcf {
- input:
- file bed from outbed
+ if(!params.no_plots) {
+ publishDir params.out_folder+'/PDF/', mode: 'move', pattern: '*.pdf'
+ }
- output:
- file '*_regions' into split_bed mode flatten
+ input:
+ file svcf from splitted_vcf
- shell:
- '''
- grep -v '^track' !{bed} | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk '{print $1" "$2" "$3}' | bed_cut.r !{params.nsplit}
- '''
-}
+ output:
+ file '*.pdf' optional true into PDF
+ file '*.vcf' into annotated
-// create mpileup file + sed to have "*" when there is no coverage (otherwise pileup2baseindel.pl is unhappy)
-process samtools_mpileup {
-
- tag { region_tag }
-
- input:
- file split_bed
- file bam
- file bai
- file fasta_ref
- file fasta_ref_fai
- file fasta_ref_gzi
-
- output:
- set val(region_tag), file("${region_tag}.pileup") into pileup
-
- shell:
- region_tag = split_bed.baseName
- '''
- while read bed_line; do
- samtools mpileup --fasta-ref !{fasta_ref} --region $bed_line --ignore-RG --min-BQ !{params.base_qual} --min-MQ !{params.map_qual} --max-idepth 1000000 --max-depth !{params.max_DP} !{bam} | sed 's/ / * */g' >> !{region_tag}.pileup
- done < !{split_bed}
- '''
-}
+ shell:
+ '''
+ tabix -p vcf !{svcf}
+ Rscript !{baseDir}/bin/annotate_vcf.r --source_path=!{baseDir}/bin/ --input_vcf=!{svcf} --chunk_size=!{params.chunk_size} --do_plots=!{!params.no_plots} --plot_labels=!{!params.no_labels} --add_contours=!{!params.no_contours} --min_coverage=!{params.min_dp} --min_reads=!{params.min_ao} --GQ_threshold=!{params.min_qval} --extra_rob=!{params.extra_robust_gl}
+ '''
+ }
-// split mpileup file and convert to table
-process mpileup2table {
-
- tag { region_tag }
-
- input:
- set val(region_tag), file("${region_tag}.pileup") from pileup.filter { tag, file -> !file.isEmpty() }
- file bam
- val sample_names
-
- output:
- set val(region_tag), file('sample*.txt'), file('names.txt') into table
-
- shell:
- if ( params.no_indels ) {
- indel_par = "-no-indels"
- } else {
- indel_par = " "
- }
- '''
- nb_pos=$(wc -l < !{region_tag}.pileup)
- if [ $nb_pos -gt 0 ]; then
- # split and convert pileup file
- pileup2baseindel.pl -i !{region_tag}.pileup !{indel_par}
- # rename the output (the converter call files sample{i}.txt)
- i=1
- for cur_bam in !{bam}
- do
- if [ "!{sample_names}" == "FILE" ]; then
- # use bam file name as sample name
- bam_file_name="${cur_bam%.*}"
- # remove whitespaces from name
- SM="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
- else
- # extract sample name from bam file read group info field
- SM=$(samtools view -H $cur_bam | grep @RG | head -1 | sed "s/.*SM:\\([^\\t]*\\).*/\\1/" | tr -d '[:space:]')
- fi
- printf "sample$i\\t$SM\\n" >> names.txt
- i=$((i+1))
- done
- fi
- '''
-}
+ process merge_vcf {
+
+ publishDir params.out_folder, mode: 'move'
+
+ input:
+ val out_annotated_vcf
+ file all_vcf from annotated.toList()
+
+ output:
+ file "$out_annotated_vcf" into merged_vcf
+
+ shell:
+ '''
+ # Extract the header from the first VCF
+ grep '^#' !{all_vcf[0]} > !{out_annotated_vcf}
+
+ # Add version numbers in the VCF header just after fileformat line
+ echo '##NeedlestackCommand=!{workflow.commandLine}' > versions.txt
+ echo '##NeedlestackRepository=!{workflow.repository}' >> versions.txt
+ echo '##NeedlestackCommitId=!{workflow.commitId}' >> versions.txt
+ echo '##NeedlestackRevision=!{workflow.revision}' >> versions.txt
+ echo '##NeedlestackContainer=!{workflow.container}' >> versions.txt
+ echo '##nextflow=v!{workflow.nextflow.version}' >> versions.txt
+ echo '##Rscript='$(Rscript --version 2>&1) >> versions.txt
+ sed -i '/##fileformat=.*/ r versions.txt' !{out_annotated_vcf}
+
+ # this is only for the split_vcf process when using the split linux command that ensures files are in the right order
+ grep -h -v '^#' split_*.vcf >> !{out_annotated_vcf}
+ # this is for the slow version of the split_vcf process
+ #for i in `seq 1 !{params.nsplit}`;
+ # do
+ # grep -v '^#' split${i}_annotated_needlestack.vcf >> !{out_annotated_vcf}
+ # done
+ '''
+ }
-// perform regression in R
-process R_regression {
+} else {
+
+ params.out_folder = params.bam_folder // if not provided, outputs will be held on the input bam folder
+ assert (params.fasta_ref != true) && (params.fasta_ref != null) : "please specify --fasta_ref option (--fasta_ref reference.fasta(.gz))"
+ assert (params.bam_folder != true) && (params.bam_folder != null) : "please specify --bam_folder option (--bam_folder bamfolder)"
+
+ fasta_ref = file( params.fasta_ref )
+ fasta_ref_fai = file( params.fasta_ref+'.fai' )
+ fasta_ref_gzi = file( params.fasta_ref+'.gzi' )
+
+ /* Verify user inputs are correct */
+
+ assert params.sb_type in ["SOR","RVSB","FS"] : "--sb_type must be SOR, RVSB or FS "
+ assert params.all_SNVs in [true,false] : "do not assign a value to --all_SNVs"
+ assert params.extra_robust_gl in [true,false] : "do not assign a value to --extra_robust_gl"
+ assert params.no_plots in [true,false] : "do not assign a value to --no_plots"
+ assert params.no_indels in [true,false] : "do not assign a value to --no_indels"
+ assert params.use_file_name in [true,false] : "do not assign a value to --use_file_name"
+ if (params.bed) { try { assert file(params.bed).exists() : "\n WARNING : input bed file not located in execution directory" } catch (AssertionError e) { println e.getMessage() } }
+ try { assert fasta_ref.exists() : "\n WARNING : fasta reference not located in execution directory. Make sure reference index is in the same folder as fasta reference" } catch (AssertionError e) { println e.getMessage() }
+ if (fasta_ref.exists()) {assert fasta_ref_fai.exists() : "input fasta reference does not seem to have a .fai index (use samtools faidx)"}
+ if (fasta_ref.exists() && params.fasta_ref.tokenize('.')[-1] == 'gz') {assert fasta_ref_gzi.exists() : "input gz fasta reference does not seem to have a .gzi index (use samtools faidx)"}
+ try { assert file(params.bam_folder).exists() : "\n WARNING : input BAM folder not located in execution directory" } catch (AssertionError e) { println e.getMessage() }
+ assert file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() > 0 : "BAM folder contains no BAM"
+ if (file(params.bam_folder).exists()) {
+ if (file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() < 10) { println "\n ERROR : BAM folder contains less than 10 BAM, exit."; System.exit(0) }
+ else if (file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.size() < 20) { println "\n WARNING : BAM folder contains less than 20 BAM, method accuracy not warranted." }
+ bamID = file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam/ }.collect { it.getName() }.collect { it.replace('.bam','') }
+ baiID = file(params.bam_folder).listFiles().findAll { it.name ==~ /.*bam.bai/ }.collect { it.getName() }.collect { it.replace('.bam.bai','') }
+ assert baiID.containsAll(bamID) : "check that every bam file has an index (.bam.bai)"
+ }
+ assert (params.min_dp >= 0) : "minimum coverage must be higher than or equal to 0 (--min_dp)"
+ assert (params.max_DP > 1) : "maximum coverage before downsampling must be higher than 1 (--max_DP)"
+ assert (params.min_ao >= 0) : "minimum alternative reads must be higher than or equal to 0 (--min_ao)"
+ assert (params.nsplit > 0) : "number of regions to split must be higher than 0 (--nsplit)"
+ assert (params.min_qval >= 0) : "minimum Phred-scale qvalue must be higher than or equal to 0 (--min_qval)"
+ assert ( (params.power_min_af > 0 && params.power_min_af <= 1) || params.power_min_af == -1 ) : "minimum allelic fraction for power computations must be in [0,1] (--power_min_af)"
+ assert (params.sigma_normal >= 0) : " sigma parameter for negative binomial must be positive (--sigma_normal)"
+ if(params.sb_type in ["SOR", "RVSB"] ) {
+ assert (params.sb_snv > 0 && params.sb_snv < 101) : "strand bias (SOR or RVSB) for SNVs must be in [0,100]"
+ assert (params.sb_indel > 0 && params.sb_indel < 101) : "strand bias (SOR or RVSB) for indels must be in [0,100]"
+ } else {
+ assert (params.sb_snv > 0 && params.sb_snv < 1001) : "strand bias (FS) for SNVs must be in [0,1000]"
+ assert (params.sb_indel > 0 && params.sb_indel < 1001) : "strand bias (FS) for indels must be in [0,1000]"
+ }
- publishDir params.out_folder+'/PDF/', mode: 'move', pattern: "*[ATCG-].pdf"
+ assert (params.map_qual >= 0) : "minimum mapping quality (samtools) must be higher than or equal to 0"
+ assert (params.base_qual >= 0) : "minimum base quality (samtools) must be higher than or equal to 0"
- tag { region_tag }
+ sample_names = params.use_file_name ? "FILE" : "BAM"
+ out_vcf = params.out_vcf ? params.out_vcf : "all_variants.vcf"
- input:
- set val(region_tag), file(table_file), file('names.txt') from table
- file fasta_ref
- file fasta_ref_fai
- file fasta_ref_gzi
+ /* manage input positions to call (bed or region or whole-genome) */
+ if (params.region) {
+ input_region = 'region'
+ } else if (params.bed) {
+ input_region = 'bed'
+ bed = file(params.bed)
+ } else {
+ input_region = 'whole_genome'
+ }
- output:
- file "${region_tag}.vcf" into vcf
- file '*.pdf' into PDF
+ log.info "Input BAM folder (--bam_folder) : ${params.bam_folder}"
+ log.info "output folder (--out_folder) : ${params.out_folder}"
+ log.info "output VCF (--out_vcf) : ${params.out_vcf}"
+ log.info "Reference in fasta format (--fasta_ref) : ${params.fasta_ref}"
+ log.info "Intervals for calling (--bed) : ${input_region}"
+ log.info "Number of regions to split (--nsplit) : ${params.nsplit}"
+ log.info "Strand bias measure (--sb_type) : ${params.sb_type}"
+ log.info "Strand bias threshold for SNVs (--sb_snv) : ${params.sb_snv}"
+ log.info "Strand bias threshold for indels (--sb_indel) : ${params.sb_indel}"
+ log.info "Minimum allelic fraction for power computations (--power_min_af): ${params.power_min_af}"
+ log.info "Sigma parameter for germline (--sigma) : ${params.sigma_normal}"
+ log.info "Samtools minimum mapping quality (--map_qual) : ${params.map_qual}"
+ log.info "Samtools minimum base quality (--base_qual) : ${params.base_qual}"
+ log.info "Samtools maximum coverage before downsampling (--max_DP) : ${params.max_DP}"
+ log.info "Sample names definition (--use_file_name) : ${sample_names}"
+ log.info(params.all_SNVs == true ? "Output all SNVs (--all_SNVs) : yes" : "Output all SNVs (--all_SNVs) : no" )
+ log.info(params.extra_robust_gl == true ? "Perform an extra-robust regression (--extra_robust_gl) : yes" : "Perform an extra-robust regression (--extra_robust_gl) : no" )
+ log.info(params.no_indels == true ? "Skip indels (--no_indels) : yes" : "Skip indels (--no_indels) : no" )
+ log.info "\n"
+
+ bam = Channel.fromPath( params.bam_folder+'/*.bam' ).toList()
+ bai = Channel.fromPath( params.bam_folder+'/*.bam.bai' ).toList()
+
+ /* Building the bed file where calling would be done */
+ process bed {
+ output:
+ file "temp.bed" into outbed
+
+ shell:
+ if (input_region == 'region')
+ '''
+ echo !{params.region} | sed -e 's/[:|-]/ /g' > temp.bed
+ '''
+
+ else if (input_region == 'bed')
+ '''
+ ln -s !{bed} temp.bed
+ '''
+
+ else if (input_region == 'whole_genome')
+ '''
+ cat !{fasta_ref_fai} | awk '{print $1" "0" "$2 }' > temp.bed
+ '''
+ }
- shell:
- '''
- # create a dummy empty pdf to avoid an error in the process when no variant is found
- touch !{region_tag}_empty.pdf
- needlestack.r --out_file=!{region_tag}.vcf --fasta_ref=!{fasta_ref} --GQ_threshold=!{params.min_qval} --min_coverage=!{params.min_dp} --min_reads=!{params.min_ao} --SB_type=!{params.sb_type} --SB_threshold_SNV=!{params.sb_snv} --SB_threshold_indel=!{params.sb_indel} --output_all_SNVs=!{params.all_SNVs} --do_plots=!{!params.no_plots} --plot_labels=!{!params.no_labels} --add_contours=!{!params.no_contours}
- '''
-}
-//PDF.flatten().filter { it.size() == 0 }.subscribe { it.delete() }
-
-// merge all vcf files in one big file
-vcf_list = vcf.toList()
-process collect_vcf_result {
-
- publishDir params.out_folder, mode: 'move'
-
- input:
- val out_vcf
- file '*.vcf' from vcf_list
- file fasta_ref_fai
-
- when:
- vcf_list.val.size()>0
-
- output:
- file "$out_vcf" into big_vcf
-
- shell:
- '''
- shopt -s dotglob
- shopt -s extglob
-
- # deal with the case of a single vcf (named .vcf instead of 1.vcf when multiple vcf are present)
- grep '^#' @(|1).vcf > header.txt
-
- # Add contigs in the VCF header
- cat !{fasta_ref_fai} | cut -f1,2 | sed -e 's/^/##contig=/' > contigs.txt
- sed -i '/##reference=.*/ r contigs.txt' header.txt
-
- # Add version numbers in the VCF header
- echo '##command=!{workflow.commandLine}' > versions.txt
- echo '##repository=!{workflow.repository}' >> versions.txt
- echo '##commitId=!{workflow.commitId}' >> versions.txt
- echo '##revision=!{workflow.revision}' >> versions.txt
- echo '##container=!{workflow.container}' >> versions.txt
- echo '##nextflow=v!{workflow.nextflow.version}' >> versions.txt
- echo '##samtools='$(samtools --version | tr '\n' ' ') >> versions.txt
- echo '##bedtools='$(bedtools --version) >> versions.txt
- echo '##Rscript='$(Rscript --version 2>&1) >> versions.txt
- echo '##perl=v'$(perl -e 'print substr($^V, 1)') >> versions.txt
- sed -i '/##source=.*/ r versions.txt' header.txt
-
- # Check if sort command allows sorting in natural order (chr1 chr2 chr10 instead of chr1 chr10 chr2)
- good_sort_version=$(sort --help | grep 'version-sort' | wc -l)
- if [ `sort --help | grep -c 'version-sort' ` == 0 ]
- then
- sort_ops="-k1,1d"
- else
- sort_ops="-k1,1V"
- fi
- # Add all VCF contents and sort
- grep --no-filename -v '^#' *.vcf | LC_ALL=C sort -t ' ' $sort_ops -k2,2n >> header.txt
- mv header.txt !{out_vcf}
- '''
+
+ /* split bed file into nsplit regions */
+ process split_bed {
+
+ input:
+ file bed from outbed
+
+ output:
+ file '*_regions' into split_bed mode flatten
+
+ shell:
+ '''
+ grep -v '^track' !{bed} | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk '{print $1" "$2" "$3}' | bed_cut.r !{params.nsplit}
+ '''
+ }
+
+ // create mpileup file + sed to have "*" when there is no coverage (otherwise pileup2baseindel.pl is unhappy)
+ process samtools_mpileup {
+
+ tag { region_tag }
+
+ input:
+ file split_bed
+ file 'BAM/*' from bam
+ file 'BAM/*' from bai
+ file fasta_ref
+ file fasta_ref_fai
+ file fasta_ref_gzi
+
+ output:
+ set val(region_tag), file("${region_tag}.pileup") into pileup
+
+ shell:
+ region_tag = split_bed.baseName
+ '''
+ set -o pipefail
+ while read bed_line; do
+ samtools mpileup --fasta-ref !{fasta_ref} --region $bed_line --ignore-RG --min-BQ !{params.base_qual} --min-MQ !{params.map_qual} --max-idepth 1000000 --max-depth !{params.max_DP} BAM/*.bam | sed 's/ / * */g' >> !{region_tag}.pileup
+ done < !{split_bed}
+ '''
+ }
+
+ // split mpileup file and convert to table
+ process mpileup2table {
+
+ tag { region_tag }
+
+ input:
+ set val(region_tag), file("${region_tag}.pileup") from pileup.filter { tag, file -> !file.isEmpty() }
+ file 'BAM/*' from bam
+ val sample_names
+
+ output:
+ set val(region_tag), file('TABLE/sample*.txt'), file('names.txt') into table
+
+ shell:
+ if ( params.no_indels ) {
+ indel_par = "-no-indels"
+ } else {
+ indel_par = " "
+ }
+ '''
+ nb_pos=$(wc -l < !{region_tag}.pileup)
+ if [ $nb_pos -gt 0 ]; then
+ # split and convert pileup file
+ pileup2baseindel.pl -i !{region_tag}.pileup !{indel_par}
+ mkdir TABLE
+ mv sample*.txt TABLE
+ i=1
+ for cur_bam in BAM/*.bam
+ do
+ if [ "!{sample_names}" == "FILE" ]; then
+ # use bam file name as sample name
+ bam_file_name=$(basename "${cur_bam%.*}")
+ # remove whitespaces from name
+ SM="$(echo -e "${bam_file_name}" | tr -d '[[:space:]]')"
+ else
+ # extract sample name from bam file read group info field
+ SM=$(samtools view -H $cur_bam | grep "^@RG" | tail -n1 | sed "s/.*SM:\\([^ ]*\\).*/\\1/" | tr -d '[:space:]')
+ fi
+ printf "sample$i $SM\\n" >> names.txt
+ i=$((i+1))
+ done
+ fi
+ '''
+ }
+
+ // perform regression in R
+ process R_regression {
+
+ if(!params.no_plots) {
+ publishDir params.out_folder+'/PDF/', mode: 'move', pattern: '*.pdf'
+ }
+
+ tag { region_tag }
+
+ input:
+ set val(region_tag), file('TABLE/*'), file('names.txt') from table
+ file fasta_ref
+ file fasta_ref_fai
+ file fasta_ref_gzi
+ file pairs_file
+
+ output:
+ file "${region_tag}.vcf" into vcf
+ file '*.pdf' optional true into PDF
+
+ shell:
+ '''
+ needlestack.r --pairs_file=!{params.pairs_file} --source_path=!{baseDir}/bin/ --out_file=!{region_tag}.vcf --fasta_ref=!{fasta_ref} --GQ_threshold=!{params.min_qval} --min_coverage=!{params.min_dp} --min_reads=!{params.min_ao} --SB_type=!{params.sb_type} --SB_threshold_SNV=!{params.sb_snv} --SB_threshold_indel=!{params.sb_indel} --output_all_SNVs=!{params.all_SNVs} --do_plots=!{!params.no_plots} --plot_labels=!{!params.no_labels} --add_contours=!{!params.no_contours} --extra_rob=!{params.extra_robust_gl} --afmin_power=!{params.power_min_af} --sigma=!{params.sigma_normal}
+ '''
+ }
+
+ // merge all vcf files in one big file
+ process collect_vcf_result {
+
+ publishDir params.out_folder, mode: 'move'
+
+ input:
+ val out_vcf
+ file all_vcf from vcf.toList()
+ file fasta_ref_fai
+
+ output:
+ file "$out_vcf" into big_vcf
+
+ when:
+ !all_vcf.empty
+
+ shell:
+ '''
+ mkdir VCF
+ mv *.vcf VCF
+ # Extract the header from the first VCF
+ sed '/^#CHROM/q' VCF/!{all_vcf[0]} > header.txt
+
+ # Add contigs in the VCF header
+ cat !{fasta_ref_fai} | cut -f1,2 | sed -e 's/^/##contig=/' > contigs.txt
+ sed -i '/##reference=.*/ r contigs.txt' header.txt
+
+ # Add version numbers in the VCF header
+ echo '##command=!{workflow.commandLine}' > versions.txt
+ echo '##repository=!{workflow.repository}' >> versions.txt
+ echo '##commitId=!{workflow.commitId}' >> versions.txt
+ echo '##revision=!{workflow.revision}' >> versions.txt
+ echo '##container=!{workflow.container}' >> versions.txt
+ echo '##nextflow=v!{workflow.nextflow.version}' >> versions.txt
+ echo '##samtools='$(samtools --version | tr '\n' ' ') >> versions.txt
+ echo '##bedtools='$(bedtools --version) >> versions.txt
+ echo '##Rscript='$(Rscript --version 2>&1) >> versions.txt
+ echo '##perl=v'$(perl -e 'print substr($^V, 1)') >> versions.txt
+ sed -i '/##source=.*/ r versions.txt' header.txt
+
+ # Check if sort command allows sorting in natural order (chr1 chr2 chr10 instead of chr1 chr10 chr2)
+ if [ `sort --help | grep -c 'version-sort' ` == 0 ]
+ then
+ sort_ops="-k1,1d"
+ else
+ sort_ops="-k1,1V"
+ fi
+ # Add all VCF contents and sort
+ grep --no-filename -v '^#' VCF/*.vcf | LC_ALL=C sort -t ' ' $sort_ops -k2,2n >> header.txt
+ mv header.txt !{out_vcf}
+ '''
+ }
}
diff --git a/nextflow.config b/nextflow.config
index a004bc7..1e861cf 100644
--- a/nextflow.config
+++ b/nextflow.config
@@ -4,7 +4,7 @@ manifest {
mainScript = 'needlestack.nf'
}
-process.container = 'iarcbioinfo/needlestack:v0.3'
+process.container = 'iarcbioinfo/needlestack:v1.0'
timeline.enabled = true
trace.enabled = true