diff --git a/CMakeLists.txt b/CMakeLists.txt index 99ccc545..8b08f3a8 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -40,7 +40,6 @@ SET(PROJ_URL "https://github.com/denovogear/denovogear") #SET(PROJ_VERSION_REV 0) ## Detect Project Version Information - FIND_PACKAGE(Git) IF(GIT_FOUND) EXECUTE_PROCESS(COMMAND ${GIT_EXECUTABLE} describe --tags diff --git a/README.md b/README.md index 20ae5905..fe30ba97 100644 --- a/README.md +++ b/README.md @@ -3,16 +3,20 @@ Authors: Don Conrad, Avinash Ramu ( aramu at genetics dot wustl dot edu ) and Reed A. Cartwright. ## RELEASE NOTES +v1.0 +made changes to indel_mrate parameter +better indenting +mu_scale scales indel mutation rate linearly + v0.5.4 added GPL v3 -updated output fields for indels, snps to be the same +updated output fields for indels, snps to be the same v0.5.3 removed 'X' allele in VCF op. VCF can be indexed by Tabix, IGVTools and used in Annovar. added region based denovo calling on BCF files, invoked with --region flag added vcf parser for denovo calling, invoked with --vcf flag - v0.5.2 Added read-depth, posterior-probability filters. Output number of sites in the BCF and number of sites passing filters. @@ -21,13 +25,13 @@ Modified paired caller output. v0.5.1 Fixed bug in triallelic configuration. Some trialleic denovo configurations were being called incorrectly. -## COMPILING +## COMPILING Compilation of DeNovoGear requires CMake. You can download CMake installers from the CMake website . Most Linux distributions allow you to install CMake using their package software. Compiling and Installing on Linux: - `tar xvzf denovogear*.tar.gz` + `tar -xvzf denovogear*.tar.gz` `cd denovogear*/build` `cmake ..` `make` @@ -38,7 +42,7 @@ Creating Packages: `make package_source` ## BINARIES -Binaries are available for MacOSX and for Linux. +Binaries are available for MacOSX and for Linux. Please see http://sourceforge.net/projects/denovogear/files/Executables/ ## RUNNING THE CODE @@ -47,49 +51,51 @@ Binaries are available for MacOSX and for Linux. DeNovoGear takes in a PED file and a BCF file as input. The PED file describes the relationship between the samples and the BCF file contains the sequencing information for every locus. #### usage: - `./denovogear auto dnm --ped sample.ped --bcf sample.bcf` + `./denovogear dnm auto --ped sample.ped --bcf sample.bcf` + + If you would rather start with the BAM files and have samtools installed, this would work, + `samtools mpileup -gDf hg19.fa s1.bam s2.bam s3.bam | ./denovogear dnm auto --ped sample.ped --bcf -` ##### about sample.bcf: -BCF files can be generated from the alignment using the samtools mpileup +BCF files can be generated from the alignment using the samtools mpileup command. The command to generate a bcf file from sample.bam is: `samtools mpileup -gDf reference.fa sample.bam > sample.bcf` - -The -D option of the samtools mpileup command retains the per-sample read depth +The -D option of the samtools mpileup command retains the per-sample read depth which is preferred by denovogear as it helps to filter out sites without a minimum number of reads(but note that DNG will work without per-sample RD information, in which case the RD tag encodes the average read depth information). The -g option computes genotype likelihoods and produces a compressed bcf output and the -f option is used to indicate the reference fasta file against which the alignment was built. A sample BCF file 'sample_CEU.bcf' is included in the distribution. -##### about sample.ped: -The PED file contains information about the trios present in the BCF file. -Please make sure that all the members of the trios specified in the PED file -are present in the BCF file. The PED file can be used to specify a subset of -individuals for analysis from the BCF (in other words not every sample in the BCF needs to +##### about sample.ped: +The PED file contains information about the trios present in the BCF file. +Please make sure that all the members of the trios specified in the PED file +are present in the BCF file. The PED file can be used to specify a subset of +individuals for analysis from the BCF (in other words not every sample in the BCF needs to be represented in the PED file). -The PED file is a tab delimited file. The first six columns of the PED file are -mandatory, these are Family ID, Individual ID, Paternal ID, Maternal ID, -Sex (1 = male; 2 = female; other = unknown) and Phenotype. Denovogear makes use of the first four columns. The sample ID's in the PED file need to be exactly as they appear in the BCF file header. Sample order within the PED file does not matter, as family relationships are completely specified by the value of the child/mother/father fields in each row. - +The PED file is a tab delimited file. The first six columns of the PED file are +mandatory, these are Family ID, Individual ID, Paternal ID, Maternal ID, +Sex (1 = male; 2 = female; other = unknown) and Phenotype. Denovogear makes use of the first four columns. The sample IDs in the PED file need to be exactly as they appear in the BCF file header. Sample order within the PED file does not matter, as family relationships are completely specified by the value of the child/mother/father fields in each row. + For example, a single line in the PED file that specifies a trio looks like: CEU NA12878_vald-sorted.bam.bam NA12891_vald-sorted.bam.bam NA12892_vald-sorted.bam.bam 2 2 -An example PED file, sample_CEU.ped, is included in the distribution directory. +An example PED file, sample_CEU.ped, is included in the distribution directory. -##### about "snp_lookup.txt" and "indel_lookup.txt": -These are tables with precomputed priors (and other useful numbers) for all possible -trio configurations, under the null (no mutation present) and alternative (true de novo). -The default tables are generated during each program run using a prior of -1 x 10 ^-8 /bp/generation on the haploid germline point mutation rate, -and 1 x 10 ^-9 /bp/generation on the haploid germline indel mutation rate. +##### about "snp_lookup.txt" and "indel_lookup.txt": +These are tables with precomputed priors (and other useful numbers) for all possible +trio configurations, under the null (no mutation present) and alternative (true de novo). +The default tables are generated during each program run using a prior of +1 x 10 ^-8 /bp/generation on the haploid germline point mutation rate, +and 1 x 10 ^-9 /bp/generation on the haploid germline indel mutation rate. -If you wish to change the default point or indel mutation rates use the --snp_mrate -or --indel_mrate switches respectively. +If you wish to change the default point or indel mutation rates use the --snp_mrate +or --indel_mrate switches respectively. For example `./denovogear dnm auto --ped sample.ped --bcf sample.bcf --snp_mrate 2e-10 --indel_mrate 1e-11` [OR] - `./denovogear dnm auto --ped sample.ped --vcf sample.vcf --snp_mrate 2e-10 --indel_mrate 1e-11` + `./denovogear dnm auto --ped sample.ped --vcf sample.vcf --snp_mrate 2e-10 --indel_mrate 1e-11` -The indel mutation rate prior is calculated based on the length of the insertion or deletion event, +The indel mutation rate prior is calculated based on the length of the insertion or deletion event, separate models are used for insertions and deletions. The two models are based on the indel observations from the 1000Genomes phase 1 data. The insertion mutation rate is modeled using the function @@ -98,13 +104,13 @@ The insertion mutation rate is modeled using the function The deletion mutation rate is modeled using the function log (mrate) = mu_scale * (-21.9313 - (0.2856 * deletionLength)) -Note that a constant factor is used to scale the mutation rate, it is set to 1.0 +Note that a constant factor is used to scale the mutation rate, it is set to 1.0 by default and can be set using the switch --mu_scale. This provides the users to scale the mutation rate prior according to their data-set. -For example, - `./denovogear dnm auto --ped sample.ped --bcf sample.bcf --mu_scale 3` +For example, + `./denovogear dnm auto --ped sample.ped --bcf sample.bcf --mu_scale 3` [OR] - `./denovogear dnm auto --ped sample.ped --vcf sample.vcf --mu_scale 3` + `./denovogear dnm auto --ped sample.ped --vcf sample.vcf --mu_scale 3` #### OUTPUT FORMAT for TRIOS @@ -113,12 +119,12 @@ The output format is a single row for each putative de novo mutation (DNM), with 1. Event type (POINT MUTATION or INDEL). 2. CHILD_ID - sample ID of trio-offspring with the DNM. - 3. chr - chromosome. - 4. pos - physical Position. + 3. chr - chromosome. + 4. pos - physical Position. 5. ref - base present in reference sequence at this position (from BCF). 6. alt - comma separated list of alternate non-reference alleles called on at-least one sample (from BCF). 7. maxlike_null - likelihood of the most likely mendelian-compatible genotype configuration. - 8. pp_null - posterior probability of the most likely mendelian-compatible genotype configuration. + 8. pp_null - posterior probability of the most likely mendelian-compatible genotype configuration. 9. tgt_null(child/mom/dad) - genotypes of the most likely mendelian-compatible configuration. 10. snpcode - code that indicates whether the configuration shown in field 6 is monomorphic (1) or contains variation (2)(internal filters, can be ignored). 11. code - This field seems to be redundant to field 7, except the codes are (6) and (9).(internal filters, can be ignored). @@ -127,10 +133,10 @@ The output format is a single row for each putative de novo mutation (DNM), with 14. tgt_dnm(child/mom/dad) - genotypes of the most likely mendelian-compatible configuration. 15. lookup - Code that indicates if the most likely DNM is a transition (4) or transversion (5) (for development use only). 16. flag - Flag that indicates whether the data for the site passed internal QC thresholds (for development use only). - 17-19. Read depth of child, parent 1 and parent 2. + 17-19. Read depth of child, parent 1 and parent 2. 20-22. Root mean square of the mapping qualities of all reads mapping to the site for child, parent 1 and parent 2. Currently these values are the same for all samples when using BCF as the input format. -Fields 17-22 are meant for filtering out low quality sites. +Fields 17-22 are meant for filtering out low quality sites. ### Separate models for the X chromosome @@ -139,34 +145,34 @@ Denovogear has separate models for autosomes, X chromosome in male offspring and #### Autosomes model usage `./denovogear dnm auto --ped sample.ped --bcf sample.bcf` - [OR] + [OR] `./denovogear dnm auto --ped sample.ped --vcf sample.vcf` -#### X in male offspring model usage +#### X in male offspring model usage `./denovogear dnm XS --ped sample.ped --bcf sample.bcf --region X` - [OR] + [OR] `./denovogear dnm XS --ped sample.ped --vcf sample.X.vcf` (only BCF allows for random access) -#### X in female offspring model usage +#### X in female offspring model usage `./denovogear dnm XD --ped sample.ped --bcf sample.bcf --region X` - [OR] - `./denovogear dnm XD --ped sample.ped --vcf sample.X.vcf` + [OR] + `./denovogear dnm XD --ped sample.ped --vcf sample.X.vcf` -### PAIRED SAMPLE ANALYSIS -DNG can be used to analyze paired samples, it is run the same way as for trios the only difference being the way samples are specified in the PED file, +### PAIRED SAMPLE ANALYSIS +DNG can be used to analyze paired samples for example to call somatic mutations between tumor/matched-normal pairs, the main difference in how to run the program is the way samples are specified in the PED file(see below), #### Usage: - + `./denovogear dnm auto --ped paired.ped --bcf sample.bcf` - [OR] + [OR] `./denovogear dnm auto --ped paired.ped --vcf sample.vcf` -About the arguments, - +About the arguments, + 1. paired.ped is a ped file containing the family-name and the name of the two samples in the bcf file. The last three columns are mandated by the PED format but are ignored by the program. An example line looks like YRI NA19240_blood_vald-sorted.bam.bam NA19240_vald-sorted.bam.bam 0 0 0 - 2. sample.bcf[vcf] is a BCF[VCF] file containing both the samples. + 2. sample.bcf[vcf] is a BCF[VCF] file containing both the samples. A sample PED file sample_paired.ped for paired sample analysis is provided with the package. @@ -176,36 +182,40 @@ The output format is a single row for each putative paired denovo mutation(DNM), 1. Event type (POINT MUTATION or INDEL). 2. TUMOR_ID - sample ID of the 'TUMOR' sample. 3. NORMAL_ID - sample ID of the 'NORMAL' sample. - 4. chr - chromosome. - 5. pos - physical Position. + 4. chr - chromosome. + 5. pos - physical Position. 6. ref - base present in reference sequence at this position (from BCF). 7. alt - comma separated list of alternate non-reference alleles called on at-least one sample (from BCF). 8. maxlike_null - likelihood of the most likely compatible genotype configuration. - 9. pp_null - posterior probability of the most likely compatible genotype configuration. + 9. pp_null - posterior probability of the most likely compatible genotype configuration. 10. tgt_null(normal/tumor) - genotypes of the most likely compatible configuration. 11. maxlike_dnm - likelihood of the most likely denovo genotype configuration. - 12. pp_dnm - posterior probability of the most likely denovo genotype configuration, a value closer to 1 indicates strong evidence of a denovo event. This is the field that is used to rank the calls. + 12. pp_dnm - posterior probability of the most likely denovo genotype configuration, a value closer to 1 indicates strong evidence of a denovo event. This is the field that is used to rank the calls. 13. tgt_dnm(normal/tumor) - genotypes of the most likely denovo configuration. - 14-15. Read depth of tumor, normal samples. + 14-15. Read depth of tumor, normal samples. 16-17. Root mean square of the mapping qualities of all reads mapping to the site for tumor, normal samples. 18-19. null_snpcode, dnm_snpcode - snpcode is a field used to classify the genotype configurations for the null and alternate case. 1 stands for hom/hom, 2 stands for het/hom, 3 stands for hom/het and 4 stands for het/het. -### PHASER -DNG can be used to obtain parental phasing information for Denovo Mutations where phase informative sites are present. This is done by looking at reads which cover both the denovo base and a phase informative positions. Phase informative positions lie within a certain window from the denovo site, the default value is 1000 but it can be set by the user. The phaser is under development, please do let us know if you find any bugs. +### PHASER +DNG can be used to obtain parental phasing information for Denovo Mutations where phase informative sites are present. This is done by looking at reads which cover both the denovo base and a phase informative positions. Phase informative positions are SNP positions that lie within a certain window from the denovo site, the default window size is 1000 bp but the window-size can be set by the user. + For each DNM, a list of phase informative sites from the genotypes file is obtained, these are the loci which lie within the phasing window. Also, these sites do not have a het/het GT configuration for the parents and where the child is het. The number of DNM and phasing allele combinations seen in the read level data is output by the program. For the phasing sites, the inferred parental origin is displayed. +For example if the base at the phasing site is T and the parental genotypes are P1:CC and p2:TC at this site, then the parent of origin of this base is p2. By looking at the base in the denovo position on this read it is possible to infer the parent of origin of the denovo mutation. + +A sample list of dnms file, sample_phasing_dnm_f is provided. Also a sample genotypes file sample_phasing_GTs_f is provided for reference. #### Usage: - `./denovogear phaser --dnm dnm_f --pgt gts_file --bam alignment --window 1000` + `./denovogear phaser --dnm dnm_f --pgt gts_file --bam alignment.bam --window 1000` -About the arguments, +About the arguments, 1. "dnm" is the list of DNMs whose parental origin is to be determined. It is a tab delimited file of the format chr pos inherited_base mutant_base 2. "pgt" contains the genotypes of the child and the parents at SNP sites. It is a tab delimited file of the format chr pos child_GT parent1_GT parent2_GT - 3. "bam" is the alignment file (.bam) of the child. - 4. "window" is an optional argument which is the maximum distance between the DNM and a phasing site. The default value is 1000. + 3. "bam" is the alignment file (.bam) of the child. + 4. "window" is an optional argument which is the maximum distance between the DNM and a phasing site. The default value is 1000. #### Output DNM_pos 1:182974758 INHERITED G VARIANT A @@ -226,14 +236,7 @@ About the arguments, Base at DNM position: A Base at phasing position: A INFERRED PARENT OF ORIGIN for DNM: p1 SUPPORTING READ COUNT: 45 Base at DNM position: G Base at phasing position: T INFERRED PARENT OF ORIGIN for DNM: p1 SUPPORTING READ COUNT: 47 - - For each DNM, a list of of eligible phasing sites from the genotypes file is obtained, these are the loci which lie within the phasing window and which do not have a het/het GT configuration for the parents and where the child is het. The number of DNM and phasing allele combinations seen in the read level data is output by the program. For the phasing sites, the inferred parental origin is displayed. - -For example if the base at the phasing site is T and the parental genotypes are P1:CC and p2:TC at this site, then the parent of origin of this base is p2. By looking at the base in the denovo position on this read it is possible to infer the parent of origin of the denovo mutation. - -A sample list of dnms file, sample_phasing_dnm_f is provided. Also a sample genotypes file sample_phasing_GTs_f is provided for reference. - -Please feel free to contact the authors about any concerns/comments. +Please feel free to contact the authors about any concerns/comments. ###General options for TRIOs and Paired Sample Calling --snp_mrate: Mutation rate prior for SNPs. [1e-8]