Skip to content

3. Parameters and Options

Inês Mendes edited this page Oct 23, 2019 · 2 revisions

To make the execution of the DEN-IM workflow as simple as possible, a set of default parameters and directives is provided, but these can be easily altered by either editing the params.config file, or by passing the new value when executing the workflow with nextflow.

The list of editable parameters can also be checked with the nextflow run DEN-IM.nf --help (or nextflow run B-UMMI\DEN-IM --help if run remotely).

nextflow run B-UMMI/DEN-IM --help 
N E X T F L O W  ~  version 0.32.0
Pulling B-UMMI/DEN-IM ...
 downloaded from https://github.com/B-UMMI/DEN-IM.git
Launching `B-UMMI/DEN-IM` [evil_euler] - revision: 66bd5d266c [master]

============================================================
                 D E N - I M
============================================================


Usage: 
    nextflow run DEN-IM.nf

   --fastq                     Path expression to paired-end fastq files. (default:fastq/*_{1,2}.*) 
   --genomeSize                Genome size estimate for the samples in Mb. It is used to estimate the coverage and other assembly parameters andchecks (integrity_coverage;check_coverage;assembly_mapping)
   --minCoverage               Minimum coverage for a sample to proceed. By default it's setto 0 to allow any coverage (integrity_coverage;check_coverage)
   --adapters                  Path to adapters files, if any. (fastqc_trimmomatic)
   --trimSlidingWindow         Perform sliding window trimming, cutting once the average quality within the window falls below a threshold. (fastqc_trimmomatic)
   --trimLeading               Cut bases off the start of a read, if below a threshold quality. (fastqc_trimmomatic)
   --trimTrailing              Cut bases of the end of a read, if below a threshold quality. (fastqc_trimmomatic)
   --trimMinLength             Drop the read if it is below a specified length. (fastqc_trimmomatic)
   --clearInput                Permanently removes temporary input files. This option is only useful to remove temporary files in large workflows and prevents nextflow's resume functionality. Use with caution. (fastqc_trimmomatic;filter_poly;bowtie;retrieve_mapped;viral_assembly;pilon)
   --pattern                   Pattern to filter the reads. Please separate parametervalues with a space and separate new parameter sets with semicolon (;). Parameters are defined by two values: the pattern (any combination of the letters ATCGN), and the number of repeats or percentage of occurrence. (filter_poly)
   --reference                 Specifies the reference genome to be provided to bowtie2-build. (bowtie)
   --index                     Specifies the reference indexes to be provided to bowtie2. (bowtie)
   --minimumContigSize         Expected genome size in bases (viral_assembly)
   --spadesMinCoverage         The minimum number of reads to consider an edge in the de Bruijn graph during the assembly (viral_assembly)
   --spadesMinKmerCoverage     Minimum contigs K-mer coverage. After assembly only keep contigs with reported k-mer coverage equal or above this value (viral_assembly)
   --spadesKmers               If 'auto' the SPAdes k-mer lengths will be determined from the maximum read length of each assembly. If 'default', SPAdes will use the default k-mer lengths.  (viral_assembly)
   --megahitKmers              If 'auto' the megahit k-mer lengths will be determined from the maximum read length of each assembly. If 'default', megahit will use the default k-mer lengths. (default: auto) (viral_assembly)
   --minAssemblyCoverage       In auto, the default minimum coverage for each assembled contig is 1/3 of the assembly mean coverage or 10x, if the mean coverage is below 10x (assembly_mapping)
   --AMaxContigs               A warning is issued if the number of contigs is overthis threshold. (assembly_mapping)
   --splitSize                 Minimum contig size (split_assembly)
   --typingReference           Typing database. (dengue_typing)
   --includeNCBI               Include NCBI DENV references in alignment. (mafft))
   --getGenome                 Retrieves the sequence of the closest reference. (dengue_typing)
   --substitutionModel         Substitution model. Option: GTRCAT, GTRCATI, ASC_GTRCAT, GTRGAMMA, ASC_GTRGAMMA etc  (raxml)
   --seedNumber                Specify an integer number (random seed) and turn on rapid bootstrapping (raxml)
   --bootstrap                 Specify the number of alternative runs on distinct starting trees (raxml)
   --simpleLabel               Simplify the labels in the newick tree (for interactive report only) (raxml)

When opening the params.config file we see the following:

params {

fastq = 'fastq/*_{1,2}.*'
genomeSize = 0.01
minCoverage = 10
adapters = 'None'
trimSlidingWindow = '5:20'
trimLeading = 3
trimTrailing = 3
trimMinLength = 55
clearInput = false
pattern = 'A 50%; T 50%; N 50%'
reference = 'ref/DENV_MAPPING_V2.fasta'
index = null
minimumContigSize = 10000
spadesMinCoverage = 2
spadesMinKmerCoverage = 2
spadesKmers = 'auto'
megahitKmers = 'auto'
minAssemblyCoverage = 'auto'
AMaxContigs = 1000
splitSize = 10000
typingReference = 'ref/DENV_TYPING_V2.fasta'
includeNCBI = true
getGenome = true
substitutionModel = 'GTRGAMMA'
seedNumber = 12345
bootstrap = 500
simpleLabel = true

}

Exhaustive parameter description

Data input

The short-read paired-end or single-end data is passed as input through the --fastq parameter. The type of input is dependent on the glob pattern defined, and it must contain at least a star wildcard character. By default it is set to match all files in the fastq/ folder that match the pattern *_R{1,2}*.

Quality control and Trimming

In the process to verify the integrity of the paired-end raw sequencing data, the integrity of the input files is assessed by attempting to decompress and read the files. An estimation of the depth of coverage is also performed. By default, the input size (--genomeSize) is set to 0.012 Mb and the minimum coverage depth (--minCoverage) is set to 10. If any input file is found to be corrupt, its progression in the workflow is aborted.

In the FastQC and Trimmomatic module, FastQC is run with the parameters –extract –nogroup –format fastq. FastQC will inform Trimmomatic on how many bases to trim from the 3’and 5’ ends of the raw reads. By default, Trimmomatic uses the default set of Illumina adapters provided with the workflow but this behavior can be overwritten with the --adapters parameter. The additional Trimmomatic parameters --trimSlidingWindow, --trimLeading, --trimTrailing and --trimMinLength can all be set to different values.

The removal of low complexity sequences is done with PrinSeq using a custom parameter (--pattern), which by default is set to the value "A 50%; T 50%; N 50%", removing sequences whose content is at least half composed of a polymeric sequence (A, T or N).

Retrieval of DENV sequences

To retrieve the reads that map to the DENV reference database, Bowtie2 is run with default parameters with the DENV mapping database as a reference. The reads and their mates that map to the reference are retrieved with samtools view -buh -F 12 and samtools fastq commands. The DENV mapping database can be altered with the –-reference parameter, or alternatively, a Bowtie2 index can be provided with the –-index parameter. This allows for the workflow to work with other databases obtained through public and owned DENV genomes.

The coverage estimation step is performed on the retrieved DENV reads with the same parameters are the first estimation (–-genomeSize=0.012 and –-minCoverage=10).

Assembly

In the assembly process, the retrieved DENV reads are firstly assembled with SPAdes Genome Assembler with the options –careful –only-assembler –cov-cutoff. The coverage cutoff if dictated by the --spadesMinCoverage and –-spadesMinKmerCoverage parameters, set to 2 by default. If the assembly with SPAdes fails to produce a contig equal or greater than the value defined in the --minimumContigSize parameter (default of 10000), the data is re-assembled with the MEGAHIT assembler with default parameters.

By default the k-mers to be used in the assembly in both tools (--spadesKmers and --megahitKmers) are automatically determined depending on the read size. If the maximum read length is equal or greater than 175 nucleotides, the assembly is done with the k-mers 55, 77, 99, 113, 127, otherwise the k-mers 21, 33, 55, 67, 77 are used.

To correct the assemblies produced, the Pilon tool is run after mapping the QC’ed reads back to the assembly with Bowtie2 and samtools sort. This process also verifies the coverage and the number of contigs produced in the assembly. The behaviour can be altered with the parameters --minAssemblyCoverage, --AMaxContigs and --genomeSize, set to "auto", 1000 and 0.01 Mb by default. The first parameter, when set to ’auto’, the minimum assembly coverage for each contig required is set to the 1/3 of the assembly mean coverage or to a minimum of 10x. The ratio of contig number per genome MB is calculated based on the genome size estimation for the samples. The contigs larger than the value defined in the --genomeSize parameter (default of 10000 nucleotides) are considered to be complete CDSs and follow the rest to the workflow independently. If no complete CDS is recovered, the QC’ed read data is passed to the mapping to module that does the DENV typing database and consensus generation.

Typing

The serotyping and genotyping is performed with the Seq_Typing tool with the command seq_typing.py assembly or seq_typing.py reads, using as reference the provided curated DENV typing database.

By default, the genomes of the closest references are retrieved and included in the downstream analysis. This behavior can be altered by changing the --getGenome option to "false".

Phylogeny

The CDSs, and the reference sequences if requested, are aligned with the MAFFT tool with the options –adjustdirection –auto. By default, four representative sequences for each DENV serotype (1 to 4) from NCBI is also included in the alignment. This option can be turned off by changing the value of --includeNCBI to "false". If the number of sequences in the alignment is less than 4 these are automatically added.

NCBI references included:

A maximum likelihood phylogenetic tree is obtained with the RaXML tool with the options -p 12345 -f -a. Additionally and by default, the substitution model (--substitutionModel) is set to "GTRGAMMA", the bootstrap is set to 500 (--bootstrap) and the seed to "12345" (--seedNumber).