Merge pull request #1497 from nextstrain/docs/rel-muts

nextstrain · Jul 2, 2024 · d90bd66 · d90bd66
2 parents 3de1907 + d54a0f1
commit d90bd66
Show file tree

Hide file tree

Showing 12 changed files with 259 additions and 142 deletions.
diff --git a/docs/Dockerfile b/docs/Dockerfile
@@ -56,6 +56,7 @@ WORKDIR /workdir
 COPY environment.yml /workdir
 
 RUN set -euxo pipefail >/dev/null \
+  && conda config --add channels conda-forge \
   && conda env create -n "docs.clades.nextstrain.org"
 
 USER ${USER}

diff --git a/docs/user/algorithm/03-mutation-calling.md b/docs/user/algorithm/03-mutation-calling.md
diff --git a/...er/algorithm/05-phylogenetic-placement.md → ...er/algorithm/03-phylogenetic-placement.md b/...er/algorithm/05-phylogenetic-placement.md → ...er/algorithm/03-phylogenetic-placement.md
@@ -1,4 +1,4 @@
-# 5. Phylogenetic placement
+# 3. Phylogenetic placement
 
 After reference alignment and mutation calling, Nextclade places each query sequence on the reference phylogenetic tree.
 

diff --git a/docs/user/algorithm/06-clade-assignment.md → docs/user/algorithm/04-clade-assignment.md b/docs/user/algorithm/06-clade-assignment.md → docs/user/algorithm/04-clade-assignment.md
@@ -1,4 +1,4 @@
-# 6. Clade assignment
+# 4. Clade assignment
 
 To simplify discussion of co-circulating virus variants, viral diversity of is often broken down into [Clades](../terminology.html#clade) or lineages which are defined by specific combinations of signature mutations. Clades are groups of related sequences that share a common ancestor. For SARS-CoV-2, Nextclade can assign both broad clades defined by the Nextstrain team as well as more fine-grained lineages defined by the PANGO consortium.
 

diff --git a/docs/user/algorithm/05-mutation-calling.md b/docs/user/algorithm/05-mutation-calling.md
@@ -0,0 +1,83 @@
+# 5. Mutation calling
+
+Nextclade calls nucleotide and aminoacid mutations relative to multiple targets.
+
+### Mutations relative to reference sequence
+
+In order to detect nucleotide mutations, aligned nucleotide sequences are compared with the reference nucleotide sequence, one nucleotide at a time. Mismatches between the query and reference sequences are then noted and reported differently, depending on their nature:
+
+- Nucleotide substitutions: a change from one character to another. For example a change from `A` in the reference sequence to `G` in the query sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as colored markers, where color signifies the resulting character (in query sequence).
+
+- Nucleotide deletions ("gaps"): nucleotide was present in the reference sequence, but is not present in the query sequence. These are indicated by the "`-`" character in the alignment sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as dark-grey markers. In the output files deletions are represented as numeric ranges, signifying the start and end of the deleted fragment (for example: `21765-21770`)
+
+- Nucleotide insertions: additional nucleotides in the query sequence that were not present in the reference sequence. They are stripped from the alignment and reported separately, showing the position in the reference after which the insertion occurred and the fragment that was inserted. `22030:ACT` would indicate that the query sequence has the three bases `ACT` inserted between position `22030` and `22031` in the reference sequence (the indices are 1-based).
+
+Nextclade also gathers and reports other useful statistics, such as the number of contiguous ranges of `N` (missing) and non-ACGTN (ambiguous) nucleotides, as well as the total counts of substituted, deleted, missing and ambiguous nucleotides. You can find this information in the results table of [Nextclade Web](../nextclade-web) and in the output files of [Nextclade CLI](../nextclade-cli).
+
+Similarly, aminoacid mutations and statistics are gathered from the aligned peptides obtained after [translation](./02-translation). This step only runs if a [genome annotation](../input-files/03-genome-annotation) is provided.
+
+### Private mutations
+
+Following the [tree placement](03-phylogenetic-placement.md), Nextclade identifies "private mutations" - the mutations between the query sequence and the sequence corresponding to the nearest neighbor (parent) on the tree.
+
+In the figure, the query sequence (dashed) is compared to all sequences (including internal nodes) of the reference tree to identify the nearest neighbor. The yellow and dark green mutations are private mutations, as they occur in addition to the 3 mutations of the attachment node.
+
+![Identification of private mutations](../assets/algo_private-muts.png)
+
+Many sequence quality problems are identifiable by the presence of private mutations. Sequences with unusually many private mutations are unlikely to be biological and are thus [flagged as bad](06-quality-control.md#private-mutations-p).
+
+Nextclade classifies private mutations further into 3 categories to be more sensitive to potential contamination, co-infection and recombination:
+
+1. Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence.
+2. Labeled mutations: Private mutations to a genotype that is known to be common in a clade.
+3. Unlabeled mutations: Private mutations that are neither reversions nor labeled.
+
+For an illustration of these 3 types, see the figure below.
+
+![Classification of private mutations](../assets/algo_private-muts-classification.png)
+
+Reversions are common artefacts in some bioinformatic pipelines when there is amplicon dropout and missing sequence is "fill-in" with the reference.
+They are also a sign of contamination, co-infection or recombination. Labeled mutations are also a common sign of contamination, co-infection or recombination and deserve special attention.
+
+For some datasets, reversions and labeled mutations are therefore weighted several times higher than unlabeled mutations due to their higher sensitivity and specificity for quality problems (and recombination).
+In February 2022, the SARS-CoV-2 dataset weighed every reversion 6 (`weightReversionSubstitutions`) while every labeled mutation was weighed 4 times (`weightLabeledSubstitutions`). Unlabeled mutations get weight 1 (`weightUnlabeledSubstitutions`).
+
+From the weighted sum, 8 (`typical`) is subtracted. The score is then a linear interpolation between 0 and 100 (and above), where 100 corresponds to 24 (`cutoff`).
+
+Private deletion ranges (including reversion) are currently counted as a single unlabeled substitution, but this could change in the future.
+
+### Clade founder search and mutations relative to clade founder
+
+For each query sample possessing a clade, Nextclade finds a corresponding "clade founder" node in the reference tree - the most ancestral node having the same clade. It starts with parent node (nearest node) obtained during [tree placement](03-phylogenetic-placement.md) and traverses the tree towards the root, until it finds the last node with the same clade as the parent node.
+
+After that Nextclade calls nucleotide and aminoacid mutations relative to the clade founder.
+
+The search and mutation calling happens separately for clades as well as for each custom clade-like attribute (unless excluded in the [pathogen config](../input-files/05-pathogen-config.md)).
+
+Clade founder search is a built-in convenience wrapper for a [node search and relative mutations](#arbitrary-node-search-and-relative-mutations) with pre-agreed search criteria (matching clades).
+
+> ⚠️ Nextclade assumes that all clades and clade-like attributes defined on the [input reference tree](../input-files/04-reference-tree.md) are [monophyletic](https://en.wikipedia.org/wiki/Monophyly). In this context it means that that all nodes belonging to one clade are a single connected component on the tree. Moreover, tree should be sufficiently large and diverse, such that early samples of each of the clades are well represented. Nextclade official datasets enforce these requirements, however third-party dataset authors and users of their datasets need to take additional care.
+
+### Arbitrary node search and relative mutations
+
+In addition to the built-in search for clade founder nodes (see above), [dataset](../datasets.md) authors may define criteria for arbitrary nodes of interest on the [reference tree](../input-files/04-reference-tree.md). Nextclade will then search these nodes, similarly to how it finds clade founder nodes, and will calculate mutations relative to each of these nodes.
+
+This could be useful, for example, for comparing sequences to the vaccine strains.
+
+### Results
+
+The mutation calling step results in a set of mutations and various practical metrics for each sequence.
+
+Mutations can be viewed in the last column of the results table in [Nextclade Web](../nextclade-web).
+
+The "Genetic feature" dropdown allows switching between nucleotide sequence and CDSes (if genome annotation is provided). The "Relative to" dropdown allows to select the target for comparison:
+
+- "Reference" - shows mutations relative to the [reference sequence](../input-files/02-reference-sequence.md)
+- "Parent" - shows private mutations, i.e. mutations relative to the parent (nearest) node
+- "Clade founder" - shows mutations relative to clade founder
+- "<attribute> founder" - shows mutations relative to clade-like attribute founder (if any defined)
+- any additional entries show mutations relative to the node(s) found according to the custom search criteria (if any defined)
+
+The "Mut" column shows total number of nucleotide mutations and its mouseover tooltip lists the mutations.
+
+All results are emitted into the output [JSON](../output-files/05-results-json), [CSV and TSV files](../output-files/04-results-tsv) in [Nextclade CLI](../nextclade-cli) and in the "Export" dialog of [Nextclade Web](../nextclade-web).
diff --git a/docs/user/algorithm/07-quality-control.md → docs/user/algorithm/06-quality-control.md b/docs/user/algorithm/07-quality-control.md → docs/user/algorithm/06-quality-control.md
@@ -1,4 +1,4 @@
-# 7. Quality Control (QC)
+# 6. Quality Control (QC)
 
 [Whole-genome sequencing](https://en.wikipedia.org/wiki/Whole_genome_sequencing) of viruses is a complex biotechnological process. Results can vary significantly in their quality, in particular, from scarce or degraded input material. Some parts of the sequence might be missing and the bioinformatic analysis pipelines that turn raw data into a consensus genome sometimes produce artefacts. Such artefacts typically manifest in spurious differences of the sequence from the reference.
 
@@ -11,7 +11,7 @@ Nextclade scans each query sequence for issues which may indicate problems occur
 For each query sequence each individual QC rule produces a quality score. These **individual QC scores** are empirically calibrated to fit the following thresholds:
 
 | Score         | Meaning            | Color designation |
-| ------------- | ------------------ | ----------------- |
+|---------------|--------------------|-------------------|
 | 0 to 29       | "good" quality     | green             |
 | 30 to 99      | "mediocre" quality | yellow            |
 | 100 and above | "bad" quality      | red               |
@@ -43,36 +43,7 @@ Ambiguous nucleotides (such as `R`, `Y`, etc) are often indicative of contaminat
 
 ### Private mutations (P)
 
-In order to assign clades, Nextclade places sequences on a reference tree that is representative of the global phylogeny (see figure below). The query sequence (dashed) is compared to all sequences (including internal nodes) of the reference tree to identify the nearest neighbor.
-
-As a by-product of this placement, Nextclade identifies the mutations, called "private mutations", that differ between the query sequence and the nearest neighbor sequence. In the figure, the yellow and dark green mutations are private mutations, as they occur in addition to the 3 mutations of the attachment node.
-
-![Identification of private mutations](../assets/algo_private-muts.png)
-
-Many sequence quality problems are identifiable by the presence of private mutations. Sequences with unusually many private mutations are unlikely to be biological and are thus flagged as bad.
-
-Since web version 1.13.0 (CLI 1.10.0), Nextclade classifies private mutations further into 3 categories to be more sensitive to potential contamination, co-infection and recombination:
-
-1. Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence.
-2. Labeled mutations: Private mutations to a genotype that is known to be common in a clade.
-3. Unlabeled mutations: Private mutations that are neither reversions nor labeled.
-
-For an illustration of these 3 types, see the figure below.
-
-![Classification of private mutations](../assets/algo_private-muts-classification.png)
-
-Reversions are common artefacts in some bioinformatic pipelines when there is amplicon dropout.
-They are also a sign of contamination, co-infection or recombination. Labeled mutations also contain commonly when there's contamination, co-infection or recombination.
-
-Reversions and labeled mutations are weighted several times higher than unlabeled mutations due to their higher sensitivity and specificity for quality problems (and recombination).
-In February 2022, every reversion was counted 6 times (`weightReversionSubstitutions`) while every labeled mutation was counted 4 times (`weightLabeledSubstitutions`). Unlabeled mutations get weight 1 (`weightUnlabeledSubstitutions`).
-
-From the weighted sum, 8 (`typical`) is subtracted. The score is then a linear interpolation between 0 and 100 (and above), where 100 corresponds to 24 (`cutoff`).
-
-Private deletion ranges (including reversion) are currently counted as a single unlabeled substitution, but this could change in the future.
-
-Which genotypes get "labeled" is determined in the dataset config file `virus_properties.json` which can also be found in the [Github repo](https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-02-07T12:00:00Z/files/virus_properties.json).
-Currently, all mutations that appear in at least 30% of the sequences of a clade or in at least 100k sequences in a clade get that clade's label.
+[Private mutations](05-mutation-calling.md#private-mutations) may indicate sequencing errors or unusual variants.
 
 ### Mutation clusters (C)
 

diff --git a/...orithm/04-pcr-primer-changes-detection.md → ...orithm/07-pcr-primer-changes-detection.md b/...orithm/04-pcr-primer-changes-detection.md → ...orithm/07-pcr-primer-changes-detection.md
@@ -1,4 +1,4 @@
-# 4. Detection of PCR primer changes
+# 7. Detection of PCR primer changes
 
 [Polymerase chain reactions (PCR)](https://en.wikipedia.org/wiki/Polymerase_chain_reaction) uses small nucleotide sequence snippets called "primers" that are [complementary](<https://en.wikipedia.org/wiki/Complementarity_(molecular_biology)>) to a specific region of the virus genome. High similarity between primers and the genome region they are supposed to bind to is required for PCR to work. Changes in the virus genome can interfere with this requirement. If Nextclade is provided with a table of PCR primers in the pathogen metadata file, it can analyze these regions in query sequences and report changes that may indicate reduced primer binding.
 

diff --git a/docs/user/algorithm/index.rst b/docs/user/algorithm/index.rst
@@ -10,8 +10,8 @@ Internally, Nextclade is implemented as a parallel pipeline which consists of se
 
     01-sequence-alignment.md
     02-translation.md
-    03-mutation-calling.md
-    04-pcr-primer-changes-detection.md
-    05-phylogenetic-placement.md
-    06-clade-assignment.md
-    07-quality-control.md
+    03-phylogenetic-placement.md
+    04-clade-assignment.md
+    05-mutation-calling.md
+    06-quality-control.md
+    07-pcr-primer-changes-detection.md
diff --git a/docs/user/input-files/03-genome-annotation.md b/docs/user/input-files/03-genome-annotation.md
@@ -6,7 +6,7 @@ The annotation is required for codon-aware alignment, for translation of CDS (Co
 
 Accepted formats: [GFF3](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3%2Emd).
 
-Since version 3, Nextclade supports multi-fragment CDSs which enable the correct translation of complex features including programmed ribosomal slippage (e.g. ORF1ab in SARS-CoV-2), genes crossing the origin of a circular genome (e.g. Hepatitis B virus) and CDS that require splicing (e.g. HIV).
+Nextclade supports multi-fragment CDSs which enable the correct translation of complex features including programmed ribosomal slippage (e.g. ORF1ab in SARS-CoV-2), genes crossing the origin of a circular genome (e.g. Hepatitis B virus) and CDS that require splicing (e.g. HIV).
 
 Almost any syntactically correct [spec](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3%2Emd)-compliant GFF3 annotation (e.g. downloaded from Genbank) should work. In practice, because GFF3 format allows for great freedom of how to express features as well as how to interpret them, some processing may be required to make it work satisfactory in Nextclade.