Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor pipeline filtering, create params flags for tools #148

Closed
wants to merge 40 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
320cb64
feat: flag for busco
slsevilla Mar 28, 2024
f0876a8
refactor: change filtering for failed samples #147
slsevilla Mar 28, 2024
4e8df4c
refactor: ica handling #149
slsevilla Mar 31, 2024
5ad96ba
refactor: updating new ica handling #149
slsevilla Mar 31, 2024
f131ab6
refactor: handling errors, ica handling, filtering #147 #149
slsevilla Mar 31, 2024
401ec10
refactor: check ica param is as expected 149
slsevilla Mar 31, 2024
47a0c5c
refactor: create ncbi_excel_creation flag #150
slsevilla Mar 31, 2024
5221309
refactor: filtering strategy bbduk #147
slsevilla Mar 31, 2024
f33a04a
chore: fix meta tuple called
slsevilla Mar 31, 2024
e9abaa2
refactor: move fastp variables to config #150
slsevilla Mar 31, 2024
3c87d90
refactor: ica handling, filtering #147 #149
slsevilla Mar 31, 2024
fccbc89
refactor: ica handling #147
slsevilla Mar 31, 2024
25a2ca6
refactor: ica handling, filtering #147 #149
slsevilla Mar 31, 2024
1a6fa53
refactor: fastqc ica handling, filtering #147 #149
slsevilla Mar 31, 2024
18132a9
refactor: move kraken params to config #150
slsevilla Apr 1, 2024
10c7bdc
refactor: krakenbh handle ica and terra #149
slsevilla Apr 1, 2024
606e6e5
refactor: kraken subwf and modules for ica, reorg wf calls #149
slsevilla Apr 1, 2024
5896dcd
refactor: add check for terra parms #151
slsevilla Apr 1, 2024
c75a6a9
refactor: add param for extended_qc #151
slsevilla Apr 1, 2024
e2c3145
refactor: spades for ica #149
slsevilla Apr 1, 2024
05343da
refactor: spades wf modules for ica #149
slsevilla Apr 1, 2024
66e8f46
refactor: rename_headers for ica #149
slsevilla Apr 1, 2024
f72f42a
refactor: ica, output file handling #149
slsevilla Apr 1, 2024
2fe0069
refactor: create extended_qc variable #151
slsevilla Apr 3, 2024
3059e03
refactor: filtering #147
slsevilla Apr 3, 2024
7720ec4
refactor: move filtering to workflow level #147
slsevilla Apr 3, 2024
c22897c
refactor: scaffolds samplesshet ica #149
slsevilla Apr 3, 2024
ef6339b
refactor: kraken2 makereport, top mash hits ica #149
slsevilla Apr 3, 2024
5ab72ca
refactor: phoenix wf filtering, ica #147 #149
slsevilla Apr 3, 2024
fd2463a
refactor: determine taxaID, fast ani ica #149
slsevilla Apr 3, 2024
b056478
refactor: mlst ica #149
slsevilla Apr 3, 2024
8c76843
refactor: mlst, amrfinder for terra, ica #149
slsevilla Apr 3, 2024
2700945
refactor: summary lines ica #149
slsevilla Apr 3, 2024
9437d8d
chore: fix missed ica flag
slsevilla Apr 4, 2024
bb780f3
refactor: summary lines ica #149
slsevilla Apr 4, 2024
a338622
feat: flags for execution #153
slsevilla Apr 4, 2024
28f9029
refactor: griphin ica #149
slsevilla Apr 4, 2024
2168e14
chore: unblock expected outputs
slsevilla Apr 9, 2024
4556847
chore: unblock griphin, outputs
slsevilla Apr 9, 2024
46f43a4
docs: changes added to log
slsevilla Apr 9, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,3 +243,23 @@ Below are the list of changes to phx since is initial release. As fixes can take
- [ARG-ANNOT](http://backup.mediterranee-infection.com/arkotheque/client/ihumed/_depot_arko/articles/2041/arg-annot-v4-aa-may2018_doc.fasta) hasn't changed since the last time the database was created and contains updates since version [NT v6 July 2019](https://www.mediterranee-infection.com/acces-ressources/base-de-donnees/arg-annot-2/)
- [ResFinder](https://bitbucket.org/genomicepidemiology/resfinder_db/src/master/)
- Includes until 2024-01-28 [commit 97d1fe0cd0a119172037f6bdb29f8a1c7c6e6019](https://bitbucket.org/genomicepidemiology/resfinder_db/commits/branch/master)

## [v3.1.0](https://github.com/CDCgov/phoenix/releases/tag/v3.1.0) (04/08/2024)
**Implemented Enhancements**
- refactors filtering failed samples for fairy
- refactors ICA handling, terra handling
- add a param flags in nextflow.config
- execution-based
- run_busco
- ncbi_excel_creation
- extended_qc
- run_srst2_mlst
- run_griphin
- feature-based
- save_trimmed_fail
- save_merged
- save_output_fastqs
- save_reads_assignment
- moves parameter checks upstream to main.nf
- ICA
- TERRA
4 changes: 2 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ process {
[
path: { "${params.outdir}/${meta.id}/file_integrity" },
mode: 'copy',
pattern: "*{_summary.txt}"
pattern: "*{_summary_fairy.txt}"
]
]
}
Expand All @@ -67,7 +67,7 @@ process {
[
path: { "${params.outdir}/${meta.id}/file_integrity" },
mode: 'copy',
pattern: "*{_summary.txt}"
pattern: "*{_summary_rawstats.txt}"
]
]
}
Expand Down
17 changes: 10 additions & 7 deletions main.nf
Original file line number Diff line number Diff line change
Expand Up @@ -43,15 +43,15 @@ workflow PHOENIX {
// Check input path parameters to see if they exist
def checkPathParamList = [ params.input, params.multiqc_config, params.kraken2db] //removed , params.fasta to stop issue w/connecting to aws and igenomes not used
for (param in checkPathParamList) { if (param) { file(param, checkIfExists: true) } }

// Check mandatory parameters
if (params.ica != true && params.ica != false) {exit 1, "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods."}
if (params.terra != true && params.terra != false) {exit 1, "Please set params.terra to either \"true\" if running on terra or \"false\" for all other methods."}

//input on command line
if (params.input) { ch_input = file(params.input) } else { exit 1, 'For -entry PHOENIX: Input samplesheet not specified!' }
ch_versions = Channel.empty() // Used to collect the software versions

main:
PHOENIX_EXTERNAL ( ch_input, ch_versions, true )
PHOENIX_EXTERNAL ( ch_input, ch_versions, params.ncbi_excel_creation )
emit:
scaffolds = PHOENIX_EXTERNAL.out.scaffolds
trimmed_reads = PHOENIX_EXTERNAL.out.trimmed_reads
Expand All @@ -60,9 +60,9 @@ workflow PHOENIX {
gamma_ar = PHOENIX_EXTERNAL.out.gamma_ar
phx_summary = PHOENIX_EXTERNAL.out.phx_summary
//output for phylophoenix
griphin_tsv = PHOENIX_EXTERNAL.out.griphin_tsv
griphin_excel = PHOENIX_EXTERNAL.out.griphin_excel
dir_samplesheet = PHOENIX_EXTERNAL.out.dir_samplesheet
griphin_tsv = params.run_griphin ? PHOENIX_EXTERNAL.out.griphin_tsv : null
griphin_excel = params.run_griphin ? PHOENIX_EXTERNAL.out.griphin_excel : null
dir_samplesheet = params.run_griphin ? PHOENIX_EXTERNAL.out.dir_samplesheet : null
//output for ncbi upload
ncbi_sra_sheet = params.create_ncbi_sheet ? PHOENIX_EXTERNAL.out.ncbi_sra_sheet : null
ncbi_biosample_sheet = params.create_ncbi_sheet ? PHOENIX_EXTERNAL.out.ncbi_biosample_sheet : null
Expand All @@ -83,6 +83,9 @@ workflow CDC_PHOENIX {
if (params.input) { ch_input = file(params.input) } else { exit 1, 'For -entry CDC_PHOENIX: Input samplesheet not specified!' }
ch_versions = Channel.empty() // Used to collect the software versions

// true is for -entry CDC_PHOENIX and CDC_SCAFFOLDS - used in SPADES
extended_qc=false

main:
PHOENIX_EXQC ( ch_input, ch_versions, true )

Expand Down
4 changes: 0 additions & 4 deletions modules/local/bbduk.nf
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,6 @@ process BBDUK {
tuple val(meta), path('*.log') , emit: log
path "versions.yml" , emit: versions

when:
//if the files are not corrupt and there are equal number of reads in each file then run bbduk
"${fairy_outcome[0]}" == "PASSED: File ${meta.id}_R1 is not corrupt." && "${fairy_outcome[1]}" == "PASSED: File ${meta.id}_R2 is not corrupt." && "${fairy_outcome[2]}" == "PASSED: Read pairs for ${meta.id} are equal."

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
Expand Down
11 changes: 3 additions & 8 deletions modules/local/check_mlst.nf
Original file line number Diff line number Diff line change
Expand Up @@ -12,23 +12,18 @@ process CHECK_MLST {
tuple val(meta), path("*_status.txt"), emit: status
path("versions.yml") , emit: versions

when:
task.ext.when == null || task.ext.when

script:
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "python ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
def script = params.ica ? "python ${params.ica_path}/fix_MLST2.py" : "fix_MLST2.py"
"""
${ica}fix_MLST2.py --input $mlst_file --taxonomy $taxonomy_file --mlst_database ${local_dbases}
${script} --input $mlst_file --taxonomy $taxonomy_file --mlst_database ${local_dbases}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
python: \$(python --version | sed 's/Python //g')
fix_MLST2.py: \$(${ica}fix_MLST2.py --version )
fix_MLST2.py: \$(${script} --version )
phoenix_base_container_tag: ${container_version}
phoenix_base_container: ${container}
END_VERSIONS
Expand Down
11 changes: 4 additions & 7 deletions modules/local/check_mlst_with_srst2.nf
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,23 @@ process CHECK_MLST_WITH_SRST2 {
task.ext.when == null || task.ext.when

script:
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "python ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
// define variables
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
def script = params.ica ? "python ${params.ica_path}/fix_MLST2.py" : "fix_MLST2.py"
"""
if [[ "${status[0]}" == "True" ]]; then
${ica}fix_MLST2.py --input $mlst_file --srst2 $srst2_file --taxonomy $taxonomy_file --mlst_database $local_dbases
${script} --input $mlst_file --srst2 $srst2_file --taxonomy $taxonomy_file --mlst_database $local_dbases
elif [[ "${status[0]}" == "False" ]]; then
${ica}fix_MLST2.py --input $mlst_file --taxonomy $taxonomy_file --mlst_database $local_dbases
${script} --input $mlst_file --taxonomy $taxonomy_file --mlst_database $local_dbases
else
echo "Something went very wrong, please open an issue on Github for the PHoeNIx developers to address."
fi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
python: \$(python --version | sed 's/Python //g')
fix_MLST2.py: \$(${ica}fix_MLST2.py --version )
fix_MLST2.py: \$(${script} --version )
phoenix_base_container_tag: ${container_version}
phoenix_base_container: ${container}
END_VERSIONS
Expand Down
9 changes: 3 additions & 6 deletions modules/local/determine_taxa_id.nf
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,17 @@ process DETERMINE_TAXA_ID {
path("versions.yml") , emit: versions

script: // This script is bundled with the pipeline, in cdcgov/phoenix/bin/
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "bash ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
// define variables
def prefix = task.ext.prefix ?: "${meta.id}"
// -r needs to be last as in -entry SCAFFOLDS/CDC_SCAFFOLDS k2_bh_summary is not passed so its a blank argument
def k2_bh_file = k2_bh_summary ? "-r $k2_bh_summary" : ""
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
def script = params.ica ? "${params.ica_path}/determine_taxID.sh" : "determine_taxID.sh"
"""
${ica}determine_taxID.sh -k $kraken_weighted -s $meta.id -f $formatted_ani_file -d $nodes_file -m $names_file $k2_bh_file
${script} -k $kraken_weighted -s $meta.id -f $formatted_ani_file -d $nodes_file -m $names_file $k2_bh_file

script_version=\$(${ica}determine_taxID.sh -V)
script_version=\$(${script} -V)

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
9 changes: 3 additions & 6 deletions modules/local/determine_taxa_id_failure.nf
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,15 @@ process DETERMINE_TAXA_ID_FAILURE {
"${spades_outcome[0]}" == "run_failure" || "${spades_outcome[1]}" == "no_scaffolds" || "${spades_outcome[2]}" == "no_contigs"

script: // This script is bundled with the pipeline, in cdcgov/phoenix/bin/
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "bash ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
// define variables
def prefix = task.ext.prefix ?: "${meta.id}"
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
def script = params.ica ? "bash ${params.ica_path}/determine_taxID.sh" : "determine_taxID.sh"
"""
${ica}determine_taxID.sh -r $k2_bh_summary -s $meta.id -d $nodes_file -m $names_file
${script} -r $k2_bh_summary -s $meta.id -d $nodes_file -m $names_file

script_version=\$(${ica}determine_taxID.sh -V)
script_version=\$(${script} -V)

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
14 changes: 4 additions & 10 deletions modules/local/determine_top_mash_hits.nf
Original file line number Diff line number Diff line change
Expand Up @@ -17,25 +17,19 @@ process DETERMINE_TOP_MASH_HITS {
"${fairy_outcome[4]}" == "PASSED: More than 0 scaffolds in ${meta.id} after filtering."

script: // This script is bundled with the pipeline, in cdcgov/phoenix/bin/
// terra=true sets paths for bc/wget for terra container paths
if (params.terra==false) { terra = ""}
else if (params.terra==true) { terra = "-t terra" }
else { error "Please set params.terra to either \"true\" or \"false\"" }
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "bash ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
// define variables
def prefix = task.ext.prefix ?: "${meta.id}"
def sample_name = "${mash_dists}" - ".txt" //get full sample name with REFSEQ_DATE
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
def script = params.ica ? "${params.ica_path}/sort_and_prep_dist.sh" : "sort_and_prep_dist.sh"
def terra = params.terra ? "-t terra" : ""
"""
mkdir reference_dir

${ica}sort_and_prep_dist.sh -a $assembly_scaffolds -x $mash_dists -o reference_dir $terra
${script} -a $assembly_scaffolds -x $mash_dists -o reference_dir $terra

script_version=\$(${ica}sort_and_prep_dist.sh -V)
script_version=\$(${script} -V)

if [[ ! -f ${sample_name}_best_MASH_hits.txt ]]; then
echo "No MASH hit found" > ${sample_name}_best_MASH_hits.txt
Expand Down
18 changes: 7 additions & 11 deletions modules/local/fairy_corruption_check.nf
Original file line number Diff line number Diff line change
Expand Up @@ -9,35 +9,31 @@ process CORRUPTION_CHECK {
val(busco_val)

output:
tuple val(meta), path('*_summary.txt'), emit: outcome
tuple val(meta), path('*_summary_old.txt'), emit: outcome_to_edit
tuple val(meta), path('*_summary_fairy.txt'), emit: outcome
path('*_summaryline.tsv'), optional:true, emit: summary_line
tuple val(meta), path('*.synopsis'), optional:true, emit: synopsis
path("versions.yml"), emit: versions

script:
// Adding if/else for if running on ICA it is a requirement to state where the script is, however, this causes CLI users to not run the pipeline from any directory.
if (params.ica==false) { ica = "" }
else if (params.ica==true) { ica = "bash ${workflow.launchDir}/bin/" }
else { error "Please set params.ica to either \"true\" if running on ICA or \"false\" for all other methods." }
// define variables
def prefix = task.ext.prefix ?: "${meta.id}"
def num1 = "${reads[0]}".minus(".fastq.gz")
def num2 = "${reads[1]}".minus(".fastq.gz")
def busco_parameter = busco_val ? "-b" : ""
def container_version = "base_v2.1.0"
def container = task.container.toString() - "quay.io/jvhagey/phoenix@"
"""
def script = params.ica ? "python ${params.ica_path}/fairy_proc.sh" : "fairy_proc.sh"
"""
#set +e
#check for file integrity and log errors
#if there is a corruption problem the script will create a *_summaryline.tsv and *.synopsis file for the sample.
${ica}fairy_proc.sh -r ${reads[0]} -p ${prefix} ${busco_parameter}
${ica}fairy_proc.sh -r ${reads[1]} -p ${prefix} ${busco_parameter}
${script} -r ${reads[0]} -p ${prefix} ${busco_parameter}
${script} -r ${reads[1]} -p ${prefix} ${busco_parameter}

script_version=\$(${ica}fairy_proc.sh -V)
script_version=\$(${script} -V)

#making a copy of the summary file to pass to READ_COUNT_CHECKS to handle file names being the same
cp ${prefix}_summary.txt ${prefix}_summary_old.txt
mv ${prefix}_summary.txt ${prefix}_summary_fairy.txt

cat <<-END_VERSIONS > versions.yml
"${task.process}":
Expand Down
Loading