Skip to content

Commit

Permalink
Add v3 mpxv datasets, still some wip
Browse files Browse the repository at this point in the history
  • Loading branch information
corneliusroemer committed Nov 16, 2023
1 parent bd6d9d1 commit 09ca178
Show file tree
Hide file tree
Showing 36 changed files with 1,345 additions and 351,393 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Unreleased

Initial release of this dataset. This dataset is similar to the v2 dataset [`MPXV/ancestral`](https://github.com/nextstrain/nextclade_data/tree/2023-08-17--15-51-24--UTC/data/datasets/MPXV/references/ancestral/versions/2023-08-01T12%3A00%3A00Z/files) with some differences.

Read more about Nextclade datasets in the documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html

### Important changes compared to v2 dataset

- Some genes have been renamed and one has been added. The new annotation is based on NCBI refseq annotations that were released in November 2022. The v2 dataset predates this refseq.
- The 4 genes in the inverted terminal repeat segment (ITR) on both ends of the genome (OPG001, OPG002, OPG003,OPG015) are now all included. The genes on the 3' end (~positions 190000-197000) now have an `_dup` appended to distinguish them.
- The gene previously named `NBT03_gp052` is now called `OPG073`
- The gene previously named `NBT03_gp174` is now called `OPG016`
- The gene previously named `NBT03_gp175` is now called `OPG015_dup`
- Gene `OPG166` has been added
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Nextclade dataset for "Mpox virus (All Clades)"

| property | value |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| authors | [Cornelius Roemer](https://neherlab.org), [Richard Neher](https://neherlab.org), [Nextstrain](https://nextstrain.org) |
| data source | Genbank |
| workflow | [github.com/nextstrain/mpox/nextclade](https://github.com/nextstrain/mpox/nextclade) |
| issues | github.com/nextstrain/mpox/issues |
| nextclade data path | nextstrain/mpox/all-clades/rivers-with-ancestral-snps |
| title | Mpox virus (All Clades) |
| taxon | [NCBI:txid10244](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=10244) |
| annotation | [NC_063383.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_063383) |
| clade definitions | github.com/mpxv-lineages/lineage-designation |
| references | [Urgent need for a non-discriminatory and non-stigmatizing nomenclature for monkeypox virus](https://doi.org/10.1371/journal.pbio.3001769) |
| related datasets | Mpox virus (Clade IIb): `nextstrain/mpox/lineage-iib/rivers`<br> Mpox virus (Lineage B.1) `nextstrain/mpox/lineage-iib/rivers-with-usa-2022-ma001-snps` |

This Nextclade dataset is intended for use with Mpox viruses of all clades (I, IIa and IIb). If your sequences are all from clade IIb, you may want to use the more specific dataset for that clade instead: `nextstrain/mpox/lineage-iib/rivers`. If your sequences are not only all from clade IIb but also specifically from the 2022 outbreak lineage B.1 (and sublineages), you may want to use the even more specific dataset for that lineage instead: `nextstrain/mpox/lineage-iib/rivers-with-usa-2022-ma001-snps`. The more specific a dataset, the faster it will run and the less overwhelming the results will be (fewer SNPs), more relevant reference sequences will be included for tree placement.

The dataset supports calling broad Mpox virus clades (I, IIa, IIb) and for sequences within clade IIb the more focused lineages (A, A.1, A.2, A.3, A.1.1, B.1, etc.). The clade and lineage nomenclature used is outlined in [Urgent need for a non-discriminatory and non-stigmatizing nomenclature for monkeypox virus](https://doi.org/10.1371/journal.pbio.3001769). The ground truth for lineage definitions is available at [github.com/mpxv-lineages/lineage-designation](https://github.com/nextstrain/mpox/nextclade). This dataset will be updated as new lineages are designated.

The reference used in this dataset is based on mpox virus NCBI refseq NC_063383.1 (`MPXV-M5312_HM12_Rivers`) but with SNPs (but not indels) that were inferred to be in the ancestor of clades I and II with a suitable orthopox outgroup. Due to this construction, the nucleotide and amino acid coordinates are hence identical to NC_063383.1.

The sequences used for construction of the reference tree are obtained from the `/ingest` workflow of the `nextstrain/monkeypox` repo. This workflow downloads and processes sequences from NCBI/Genbank. Sequences are sampled from all clades and all lineages over time and countries for a representative sample of the diversity of Mpox virus. The sequences are then aligned with Nextclade and a maximum likelihood tree is inferred with IQ-TREE. The tree is then rooted with the reconstructed ancestral sequence and ancestral states are inferred with TreeTime.

## Further reading

Read more about Nextclade datasets in Nextclade documentation: https://docs.nextstrain.org/projects/nextclade/en/stable/user/datasets.html
Loading

0 comments on commit 09ca178

Please sign in to comment.