ancestral / translate improvements & tests #1348

jameshadfield · 2023-12-04T20:43:19Z

Description of proposed changes

A number of small commits which both leave the code in a better place and set the stage for a number of forthcoming PRs which will solve (bigger) bugs. Tests using a small, hand-crafted genome have been added for both ancestral & translate. Command arguments have been improved in both the --help messages and the code which checks for incompatible arg combinations. Additionally two small bugs have been fixed - #1345 and #591. See commit messages for more.

Re: added tests, I have a summary PDF of the genome I created which I can add to this PR if that would be helpful?

Related issue(s)

#1345
#591

Checklist

Checks pass
If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

codecov · 2023-12-11T01:11:20Z

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (657cd7b) 65.94% compared to head (03c1965) 66.04%.
Report is 4 commits behind head on master.

Files	Patch %	Lines
augur/translate.py	76.92%	5 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1348      +/-   ##
==========================================
+ Coverage   65.94%   66.04%   +0.10%     
==========================================
  Files          68       68              
  Lines        7156     7172      +16     
  Branches     1758     1761       +3     
==========================================
+ Hits         4719     4737      +18     
+ Misses       2173     2171       -2     
  Partials      264      264

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jameshadfield · 2023-12-11T01:13:01Z

Rebased onto master (which now includes #1344) and updated test formats as per #1176

victorlin

Skimmed through this quickly - seems good! Just a couple small things.

victorlin · 2023-12-19T00:28:59Z

tests/functional/ancestral/cram/general.t

+- Ambiguous nucleotides (there are 3 Ns in sample_C) are inferred (default)
+
+  $ ${AUGUR} ancestral \
+  >  --tree $TESTDIR/../data/simple-genome/tree.nwk \


(non-blocking, more about consistency)

Suggested change

> --tree $TESTDIR/../data/simple-genome/tree.nwk \

> --tree "$TESTDIR"/../data/simple-genome/tree.nwk \

This isn't a dealbreaker because $TESTDIR is known to not have spaces, but it's used across other tests probably just as good practice. Recommendation applies generally to variables in this test file.

Yeah good shout - I think I've fixed these across all the modified tests (well, once the force push comes through I will have)

For my understanding, if we have something like DATA="$TESTDIR/../data" - do we then need to quote every string which includes $DATA? For instance, must --tree "${DATA}/tb/tree.nwk" be quoted like so, or is --tree ${DATA}/tb/tree.nwk going to be ok? (Assuming of course that $DATA has a space in it.)

For instance, across the board we do

export AUGUR="${AUGUR:-$TESTDIR/../../../../bin/augur}" # appropriate quoting $ {AUGUR} translate ... # no quoting

For the DATA example, yes that requires quoting. Here's an example: 4281ed8

For $AUGUR, if I were writing all the tests from scratch, I would add quotes. I think it's easier to always quote as a rule of thumb, even if the value is known to not have spaces, just so I (and readers) don't have to think about whether it might contain spaces now or in the future.

Following up for anyone revisiting this issue

We cannot use quoting for $AUGUR because our CI does not want it treated as a single word:

augur/.github/workflows/ci.yaml

Lines 67 to 68 in 9cd915a

env:

AUGUR: coverage run -a ${{ github.workspace }}/bin/augur

Oh that's a good point, thanks for noting! Unquoted should be fine as long as we don't use spaces in our test paths.

CHANGES.md

While we had tests that (probably?) covered the code here, using a simple genome is much easier to reason with and to validate every mutation. In my opinion it's also important to see tests using "big" datasets (e.g. those bigger than we can manually verify each and every mutation) as not necessarily testing the accuracy of the command, but rather testing for regressions / changes in behaviour. VCF-input tests using the same data will be added in a future commit because (spoiler!) there are bugs which need fixing.

The move to "Creating files in the initial working directory" is motivated by <#1344 (comment)> and <#1176>. Additionally, I remove the pushd commands which were confusing (there were multiple!) and use variables to refer to common directories to improve readability.

Uses the same approach as an earlier commit introduced for `augur ancestral`, but this time for `augur translate`. Tests the `--genes` argument as well for good measure. The exact format of GFF/GenBank files here is extremely important, but future commits will hopefully relax this.

This helps readability of the --help content. We enforce this behaviour by adding a check in the code following the example in `augur ancestral`

Simplifies the reading of the main function, and follows the established pattern in this file of using a pair of functions (one for VCF, one for JSON/FASTA). Added docstrings. The only code change is to throw an AugurError rather than `return 1`, but the result is unchanged.

There should be no functional changes. Co-locating the sequence reading & feature extraction is easier to read, and were Python's scoping to be different it'd be even nicer as we wouldn't leave around variables which are never re-used. As part of this `translate_vcf_feature` has changed from using an undocumented `return None` to raising a (documented) error which (IMO) is easier to reason with.

jameshadfield · 2023-12-20T00:43:06Z

Thanks for the review @victorlin

Merging despite CI failure, as that's unrelated to this PR - it's failing while running the pathogen-repo-rsv workflow:

 /home/runner/micromamba/envs/augur/bin/bash: line 2: nextclade3: command not found

jameshadfield · 2023-12-20T01:54:41Z

Unfortunately a few commits were squashed together in this PR. It's now been merged so I'm not going to change anything.

03c1965 contains what was meant to be three commits - a small refactor + the following commits

[translate] require nodes to have sequences

This is the implicit expectation, and is true for all of our pipelines.
In theory it's possible to translate without having the full sequence
attached to the node by reconstructing mutations on the root, but that
is not the approach taken by the current code.

As we explicitly check the tree nodes against the sequences in the
node-data JSON we can skip the automatic tests optionally performed when
reading the node-data JSON.

Closes #1345

diff --git a/CHANGES.md b/CHANGES.md
index 64f40df5..1cfe73a2 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -8,7 +8,7 @@
 * ancestral: Improvements to command line arguments. [#1344][] (@jameshadfield)
      * Incompatible arguments are now checked, especially related to VCF vs FASTA inputs. 
      * `--vcf-reference` and `--root-sequence` are now mutually exclusive.
-
+* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
 * translate: Improvements to command line arguments.  [#1348][] (@jameshadfield)
     * `--tree` and `--ancestral-sequences` are now required arguments.
     * separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 461738b2..1fc73040 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -20,6 +20,7 @@ from .io.vcf import write_VCF_translation
 from .utils import read_node_data, load_features, write_json, get_json_name
 from treetime.vcf_utils import read_vcf
 from augur.errors import AugurError
+from textwrap import dedent
 
 class MissingNodeError(Exception):
     pass
@@ -339,7 +340,7 @@ def sequences_json(node_data_json, tree):
     Extract the full nuc sequence for each node in the provided node-data JSON.
     Returns a dict, keys are node names and values are a string of the genome sequence (nuc)
     """
-    node_data = read_node_data(node_data_json, tree)
+    node_data = read_node_data(node_data_json)
     if node_data is None:
         raise AugurError("could not read node data (incl sequences)")
     # extract sequences from node meta data
@@ -347,6 +348,14 @@ def sequences_json(node_data_json, tree):
     for k,v in node_data['nodes'].items():
         if 'sequence' in v:
             sequences[k] = v['sequence']
+    tree_nodes = {c.name for c in tree.find_clades()}
+    tree_nodes_missing_sequences = tree_nodes - set(sequences.keys())
+    if len(tree_nodes_missing_sequences):
+        raise AugurError(dedent(f"""\
+            {len(tree_nodes_missing_sequences)} nodes on the tree are missing nucleotide sequences in the node-data JSON.
+            These must be present under 'nodes' → <node_name> → 'sequence'.
+            This error may originate from using 'augur ancestral' with VCF input; in this case try using VCF output from that command here.
+            """))
     return sequences
 
 def register_parser(parent_subparsers):
@@ -412,7 +421,7 @@ def run(args):
         if len(features_without_variation):
             print("{} genes had no mutations and so have been be excluded.".format(len(features_without_variation)))  
     else:
-        sequences = sequences_json(args.ancestral_sequences, args.tree)
+        sequences = sequences_json(args.ancestral_sequences, tree)
         translations = {fname: translate_feature(sequences, feat) for fname, feat in features.items() if feat.type != 'source'}
 
     ## glob the annotations for later auspice export

[translate] skip parsing source (nuc) as a feature

This check is already in place for non-VCF inputs, and my guess is it
was omitted here as the TB pipeline's GFF file didn't include a 'source'
annotation. I don't think 'source' is actually a valid GFF ID and I
suspect we've just been applying the INSDC/GenBank term to GFF files,
but it is one of the two fields parsed by `load_features` and there are
GFF files in Nextstrain build pipelines which use it. Modifying the
underlying `load_features` would be a better solution, but that's a
bigger project for another day.

We additionally update the error message to use the same feature name we
export.

Closes #591
Supersedes #1109

diff --git a/CHANGES.md b/CHANGES.md
index 1cfe73a2..54b14f8d 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -9,6 +9,7 @@
      * Incompatible arguments are now checked, especially related to VCF vs FASTA inputs. 
      * `--vcf-reference` and `--root-sequence` are now mutually exclusive.
 * translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
+* translate: The 'source' ID for GFF files is now ignored as a potential gene feature. [#1348][] (@jameshadfield)
 * translate: Improvements to command line arguments.  [#1348][] (@jameshadfield)
     * `--tree` and `--ancestral-sequences` are now required arguments.
     * separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 1fc73040..8c7253c3 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -129,7 +129,7 @@ def translate_feature(aln, feature):
     return translations
 
 
-def translate_vcf_feature(sequences, ref, feature):
+def translate_vcf_feature(sequences, ref, feature, feature_name):
     '''Translates a subsequence of input nucleotide sequences.
 
     Parameters
@@ -168,7 +168,7 @@ def translate_vcf_feature(sequences, ref, feature):
     # Need to get ref translation to store. check if multiple of 3 for sanity.
     # will be padded in safe_translate if not
     if len(refNuc)%3:
-        print("Gene length of {} is not a multiple of 3. will pad with N".format(feature.qualifiers['Name'][0]), file=sys.stderr)
+        print(f"Gene length of {feature_name!r} is not a multiple of 3. will pad with N", file=sys.stderr)
 
     ref_aa_seq = safe_translate(refNuc)
     prot['reference'] = ref_aa_seq
@@ -409,13 +409,16 @@ def run(args):
     print("Read in {} features from reference sequence file".format(len(features)))
 
     ## Read in sequences & for each sequence translate each feature _except for_ the source (nuc) feature
+    ## Note that `load_features` _only_ extracts {'gene', 'source'} for GFF files, {'CDS', 'source'} for GenBank.
     translations = {}
     if is_vcf:
         (sequences, ref) = sequences_vcf(args.vcf_reference, args.ancestral_sequences)
         features_without_variation = []
         for fname, feat in features.items():
+            if feat.type=='source':
+                continue
             try:
-                translations[fname] = translate_vcf_feature(sequences, ref, feat)
+                translations[fname] = translate_vcf_feature(sequences, ref, feat, fname)
             except NoVariationError:
                 features_without_variation.append(fname)
         if len(features_without_variation):
diff --git a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
index f0d25bee..1de2856e 100644
--- a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
+++ b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
@@ -15,7 +15,7 @@ Translate amino acids for genes using a GFF3 file where the gene names are store
   >   --output-node-data aa_muts.json \
   >   --alignment-output translations.vcf \
   >   --vcf-reference-output translations_reference.fasta
-  Gene length of rrs_Rvnr01 is not a multiple of 3. will pad with N
+  Gene length of 'rrs' is not a multiple of 3. will pad with N
   Read in 187 specified genes to translate.
   Read in 187 features from reference sequence file
   162 genes had no mutations and so have been be excluded.

victorlin · 2023-12-20T18:40:57Z

@jameshadfield thanks for writing that! I've copied it into a comment on 03c1965.

jameshadfield force-pushed the james/ancestral-arg-improvements branch 2 times, most recently from 9566edd to a529731 Compare December 4, 2023 21:00

jameshadfield force-pushed the james/ancestral-translate-fixes branch 2 times, most recently from 2c201bf to 357345f Compare December 5, 2023 00:10

jameshadfield force-pushed the james/ancestral-arg-improvements branch from a529731 to 5b0545f Compare December 7, 2023 00:24

jameshadfield mentioned this pull request Dec 7, 2023

[augur ancestral] arg parsing improvements #1344

Merged

Base automatically changed from james/ancestral-arg-improvements to master December 10, 2023 20:33

jameshadfield force-pushed the james/ancestral-translate-fixes branch from 357345f to 84343d8 Compare December 11, 2023 01:11

victorlin approved these changes Dec 19, 2023

View reviewed changes

jameshadfield added 7 commits December 20, 2023 12:10

[translate] make required arguments required

112e73a

[translate] separate out VCF-specific arguments

efb43b0

This helps readability of the --help content. We enforce this behaviour by adding a check in the code following the example in `augur ancestral`

jameshadfield force-pushed the james/ancestral-translate-fixes branch from 84343d8 to 03c1965 Compare December 19, 2023 23:24

jameshadfield merged commit fc900aa into master Dec 20, 2023
23 of 24 checks passed

jameshadfield deleted the james/ancestral-translate-fixes branch December 20, 2023 00:43

This was referenced Dec 20, 2023

CI failure: nextclade3 command not found nextstrain/rsv#50

Closed

Improve parsing of GenBank / GFF files #1351

Merged

This was referenced Dec 20, 2023

Error in translate.py from vcf #591

Closed

Fix translate #1109

Closed

ancestral, translate incompatible under certain invocations #1345

Closed

jameshadfield mentioned this pull request Jan 25, 2024

ENH(translate): Better input arg validation for #1181

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ancestral / translate improvements & tests #1348

ancestral / translate improvements & tests #1348

jameshadfield commented Dec 4, 2023 •

edited

Loading

codecov bot commented Dec 11, 2023 •

edited

Loading

jameshadfield commented Dec 11, 2023

victorlin left a comment

victorlin Dec 19, 2023

jameshadfield Dec 19, 2023 •

edited

Loading

victorlin Dec 20, 2023

jameshadfield Jan 22, 2024

victorlin Jan 22, 2024

jameshadfield commented Dec 20, 2023

jameshadfield commented Dec 20, 2023

victorlin commented Dec 20, 2023

	> --tree $TESTDIR/../data/simple-genome/tree.nwk \
	> --tree "$TESTDIR"/../data/simple-genome/tree.nwk \

	env:
	AUGUR: coverage run -a ${{ github.workspace }}/bin/augur

ancestral / translate improvements & tests #1348

ancestral / translate improvements & tests #1348

Conversation

jameshadfield commented Dec 4, 2023 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

codecov bot commented Dec 11, 2023 • edited Loading

Codecov Report

jameshadfield commented Dec 11, 2023

victorlin left a comment

Choose a reason for hiding this comment

victorlin Dec 19, 2023

Choose a reason for hiding this comment

jameshadfield Dec 19, 2023 • edited Loading

Choose a reason for hiding this comment

victorlin Dec 20, 2023

Choose a reason for hiding this comment

jameshadfield Jan 22, 2024

Choose a reason for hiding this comment

victorlin Jan 22, 2024

Choose a reason for hiding this comment

jameshadfield commented Dec 20, 2023

jameshadfield commented Dec 20, 2023

victorlin commented Dec 20, 2023

jameshadfield commented Dec 4, 2023 •

edited

Loading

codecov bot commented Dec 11, 2023 •

edited

Loading

jameshadfield Dec 19, 2023 •

edited

Loading