-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add workflow for producing the Nextclade dengue dataset #25
- Loading branch information
Showing
75 changed files
with
59,183 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Nextclade | ||
|
||
Previously, all "official" Nextclade workflows lived in a [central GitHub repository](https://github.com/neherlab/nextclade_data_workflows). | ||
The new standard would be to include the Nextclade workflow within the pathogen repo. | ||
|
||
This workflow is used to create the Nextclade datasets for this pathogen. | ||
All official Nextclade datasets are available at https://github.com/nextstrain/nextclade_data. | ||
|
||
## Workflow Usage | ||
|
||
The workflow can be run from the top level pathogen repo directory: | ||
``` | ||
nextstrain build nextclade | ||
``` | ||
|
||
Alternatively, the workflow can also be run from within the nextclade directory: | ||
``` | ||
cd nextclade | ||
nextstrain build . | ||
``` | ||
|
||
This produces the default outputs of the nextclade workflow: | ||
|
||
- nextclade_dataset(s) = datasets/<build_name>/* | ||
|
||
## Defaults | ||
|
||
The defaults directory contains all of the default configurations for the Nextclade workflow. | ||
|
||
[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters | ||
used for the Nextclade workflow. Use Snakemake's `--configfile`/`--config` | ||
options to override these default values. | ||
|
||
## Snakefile and rules | ||
|
||
The rules directory contains separate Snakefiles (`*.smk`) as modules of the core Nextclade workflow. | ||
The modules of the workflow are in separate files to keep the main nextclade [Snakefile](Snakefile) succinct and organized. | ||
|
||
The `workdir` is hardcoded to be the nextclade directory so all filepaths for | ||
inputs/outputs should be relative to the nextclade directory. | ||
|
||
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes) | ||
in the main Snakefile in the order that they are expected to run. | ||
|
||
## Build configs | ||
|
||
The build-configs directory contains custom configs and rules that override and/or | ||
extend the default workflow. | ||
|
||
- [test-dataset](build-configs/test-dataset/) - build to test new Nextclade dataset |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
configfile: "config/config_dengue.yaml" | ||
|
||
serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4'] | ||
genes = ['genome'] | ||
|
||
wildcard_constraints: | ||
serotype = "|".join(serotypes), | ||
gene = "|".join(genes) | ||
|
||
rule all: | ||
input: | ||
auspice_json = expand("auspice/dengue_{serotype}_{gene}.json", serotype=serotypes, gene=genes), | ||
nextclade_dataset = expand("datasets/{serotype}/tree.json", serotype=serotypes), | ||
test_dataset = expand("test_output/{serotype}", serotype=serotypes), | ||
|
||
include: "rules/prepare_sequences.smk" | ||
include: "rules/construct_phylogeny.smk" | ||
include: "rules/annotate_phylogeny.smk" | ||
include: "rules/export.smk" | ||
include: "rules/assemble_dataset.smk" | ||
|
||
# Include custom rules defined in the config. | ||
if "custom_rules" in config: | ||
for rule_file in config["custom_rules"]: | ||
|
||
include: rule_file | ||
|
||
rule clean: | ||
"""Removing directories: {params}""" | ||
params: | ||
"results ", | ||
"auspice" | ||
shell: | ||
"rm -rfv {params}" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Sequences must be FASTA and metadata must be TSV | ||
# Both files must be zstd compressed | ||
# Both files must have a {serotype} expandable field to be replaced by all, denv1-denv4 | ||
sequences_url: "https://data.nextstrain.org/files/workflows/dengue/sequences_{serotype}.fasta.zst" | ||
metadata_url: "https://data.nextstrain.org/files/workflows/dengue/metadata_{serotype}.tsv.zst" | ||
|
||
strain_id_field: "genbank_accession" | ||
display_strain_field: "strain" | ||
|
||
filter: | ||
exclude: "../phylogenetic/config/exclude.txt" | ||
include: "../phylogenetic/config/include_{serotype}.txt" | ||
group_by: "year region" | ||
min_length: | ||
genome: 5000 | ||
E: 1000 | ||
sequences_per_group: | ||
all: '10' | ||
denv1: '36' | ||
denv2: '36' | ||
denv3: '36' | ||
denv4: '36' | ||
|
||
traits: | ||
sampling_bias_correction: '3' | ||
traits_columns: | ||
all: 'region serotype_genbank genotype_nextclade' | ||
denv1: 'country region serotype_genbank genotype_nextclade' | ||
denv2: 'country region serotype_genbank genotype_nextclade' | ||
denv3: 'country region serotype_genbank genotype_nextclade' | ||
denv4: 'country region serotype_genbank genotype_nextclade' | ||
|
||
clades: | ||
clade_definitions: | ||
all: '../phylogenetic/config/clades_serotypes.tsv' | ||
denv1: '../phylogenetic/config/clades_genotypes.tsv' | ||
denv2: '../phylogenetic/config/clades_genotypes.tsv' | ||
denv3: '../phylogenetic/config/clades_genotypes.tsv' | ||
denv4: '../phylogenetic/config/clades_genotypes.tsv' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
## Unreleased | ||
|
||
Initial release for Nextclade v3! | ||
|
||
Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Nextclade dataset for "Dengue Virus" | ||
|
||
## Dataset attributes | ||
|
||
Nextclade dataset | ||
|
||
Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
##gff-version 3 | ||
##sequence-region NC_002640.1 1 10649 | ||
NC_002640.1 feature gene 102 440 . + . codon_start=1;gene=C;gene_name=C; | ||
NC_002640.1 feature gene 441 713 . + . codon_start=1;gene=pr;gene_name=pr; | ||
NC_002640.1 feature gene 441 938 . + . codon_start=1;gene=M;gene_name=M; | ||
NC_002640.1 feature gene 939 2423 . + . codon_start=1;gene=E;gene_name=E; | ||
NC_002640.1 feature gene 2424 3479 . + . codon_start=1;gene=NS1;gene_name=NS1; | ||
NC_002640.1 feature gene 3480 4133 . + . codon_start=1;gene=NS2A;gene_name=NS2A; | ||
NC_002640.1 feature gene 4134 4523 . + . codon_start=1;gene=NS2B;gene_name=NS2B; | ||
NC_002640.1 feature gene 4524 6377 . + . codon_start=1;gene=NS3;gene_name=NS3; | ||
NC_002640.1 feature gene 6378 6758 . + . codon_start=1;gene=NS4A;gene_name=NS4A; | ||
NC_002640.1 feature gene 6759 6827 . + . codon_start=1;gene=2K;gene_name=2K; | ||
NC_002640.1 feature gene 6828 7562 . + . codon_start=1;gene=NS4B;gene_name=NS4B; | ||
NC_002640.1 feature gene 7563 10262 . + . codon_start=1;gene=NS5;gene_name=NS5; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
{ | ||
"alignmentParams": { | ||
"minSeedCover": 0.01, | ||
"minLength": 1000 | ||
}, | ||
"attributes": { | ||
"name": "Dengue virus All serotypes", | ||
"reference accession": "NC_002640", | ||
"reference name": "dengue virus type 4" | ||
}, | ||
"compatibility": { | ||
"cli": "3.0.0-alpha.0", | ||
"web": "3.0.0-alpha.0" | ||
}, | ||
"deprecated": false, | ||
"enabled": true, | ||
"experimental": true, | ||
"files": { | ||
"changelog": "CHANGELOG.md", | ||
"examples": "sequences.fasta", | ||
"genomeAnnotation": "genome_annotation.gff3", | ||
"pathogenJson": "pathogen.json", | ||
"readme": "README.md", | ||
"reference": "reference.fasta", | ||
"treeJson": "tree.json" | ||
}, | ||
"meta": { | ||
"bugs": "https://github.com/nextstrain/nextclade_data/issues", | ||
"source code": "https://github.com/nextstrain/nextclade_data" | ||
}, | ||
"qc": { | ||
"frameShifts": { | ||
"enabled": false | ||
}, | ||
"missingData": { | ||
"enabled": false, | ||
"missingDataThreshold": 2700, | ||
"scoreBias": 300 | ||
}, | ||
"mixedSites": { | ||
"enabled": false, | ||
"mixedSitesThreshold": 10 | ||
}, | ||
"privateMutations": { | ||
"cutoff": 24, | ||
"enabled": false, | ||
"typical": 8, | ||
"weightLabeledSubstitutions": 2, | ||
"weightReversionSubstitutions": 1, | ||
"weightUnlabeledSubstitutions": 1 | ||
}, | ||
"snpClusters": { | ||
"clusterCutOff": 5, | ||
"enabled": false, | ||
"scoreWeight": 50, | ||
"windowSize": 100 | ||
}, | ||
"stopCodons": { | ||
"enabled": false | ||
} | ||
}, | ||
"schemaVersion": "3.0.0", | ||
"version": { | ||
"tag": "unreleased" | ||
}, | ||
"defaultCds": "E" | ||
} |
Oops, something went wrong.