Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ancestral / translate improvements & tests #1348

Merged
merged 7 commits into from
Dec 20, 2023

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Dec 4, 2023

Description of proposed changes

A number of small commits which both leave the code in a better place and set the stage for a number of forthcoming PRs which will solve (bigger) bugs. Tests using a small, hand-crafted genome have been added for both ancestral & translate. Command arguments have been improved in both the --help messages and the code which checks for incompatible arg combinations. Additionally two small bugs have been fixed - #1345 and #591. See commit messages for more.

Re: added tests, I have a summary PDF of the genome I created which I can add to this PR if that would be helpful?

image

Related issue(s)

#1345
#591

Checklist

  • Checks pass
  • If making user-facing changes, add a message in CHANGES.md summarizing the changes in this PR

@jameshadfield jameshadfield force-pushed the james/ancestral-arg-improvements branch 2 times, most recently from 9566edd to a529731 Compare December 4, 2023 21:00
@jameshadfield jameshadfield force-pushed the james/ancestral-translate-fixes branch 2 times, most recently from 2c201bf to 357345f Compare December 5, 2023 00:10
@jameshadfield jameshadfield force-pushed the james/ancestral-arg-improvements branch from a529731 to 5b0545f Compare December 7, 2023 00:24
Base automatically changed from james/ancestral-arg-improvements to master December 10, 2023 20:33
Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Attention: 12 lines in your changes are missing coverage. Please review.

Comparison is base (657cd7b) 65.94% compared to head (03c1965) 66.04%.
Report is 4 commits behind head on master.

Files Patch % Lines
augur/translate.py 76.92% 5 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1348      +/-   ##
==========================================
+ Coverage   65.94%   66.04%   +0.10%     
==========================================
  Files          68       68              
  Lines        7156     7172      +16     
  Branches     1758     1761       +3     
==========================================
+ Hits         4719     4737      +18     
+ Misses       2173     2171       -2     
  Partials      264      264              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jameshadfield
Copy link
Member Author

Rebased onto master (which now includes #1344) and updated test formats as per #1176

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed through this quickly - seems good! Just a couple small things.

- Ambiguous nucleotides (there are 3 Ns in sample_C) are inferred (default)

$ ${AUGUR} ancestral \
> --tree $TESTDIR/../data/simple-genome/tree.nwk \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(non-blocking, more about consistency)

Suggested change
> --tree $TESTDIR/../data/simple-genome/tree.nwk \
> --tree "$TESTDIR"/../data/simple-genome/tree.nwk \

This isn't a dealbreaker because $TESTDIR is known to not have spaces, but it's used across other tests probably just as good practice. Recommendation applies generally to variables in this test file.

Copy link
Member Author

@jameshadfield jameshadfield Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good shout - I think I've fixed these across all the modified tests (well, once the force push comes through I will have)

For my understanding, if we have something like DATA="$TESTDIR/../data" - do we then need to quote every string which includes $DATA? For instance, must --tree "${DATA}/tb/tree.nwk" be quoted like so, or is --tree ${DATA}/tb/tree.nwk going to be ok? (Assuming of course that $DATA has a space in it.)

For instance, across the board we do

export AUGUR="${AUGUR:-$TESTDIR/../../../../bin/augur}" # appropriate quoting
$ {AUGUR} translate ... # no quoting

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the DATA example, yes that requires quoting. Here's an example: 4281ed8

For $AUGUR, if I were writing all the tests from scratch, I would add quotes. I think it's easier to always quote as a rule of thumb, even if the value is known to not have spaces, just so I (and readers) don't have to think about whether it might contain spaces now or in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up for anyone revisiting this issue

We cannot use quoting for $AUGUR because our CI does not want it treated as a single word:

env:
AUGUR: coverage run -a ${{ github.workspace }}/bin/augur

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point, thanks for noting! Unquoted should be fine as long as we don't use spaces in our test paths.

CHANGES.md Show resolved Hide resolved
While we had tests that (probably?) covered the code here, using a
simple genome is much easier to reason with and to validate every
mutation. In my opinion it's also important to see tests using "big"
datasets (e.g. those bigger than we can manually verify each and every
mutation) as not necessarily testing the accuracy of the command, but
rather testing for regressions / changes in behaviour.

VCF-input tests using the same data will be added in a future commit
because (spoiler!) there are bugs which need fixing.
The move to "Creating files in the initial working directory" is
motivated by
<#1344 (comment)>
and <#1176>.

Additionally, I remove the pushd commands which were confusing (there
were multiple!) and use variables to refer to common directories to
improve readability.
Uses the same approach as an earlier commit introduced for `augur
ancestral`, but this time for `augur translate`. Tests the `--genes`
argument as well for good measure. The exact format of GFF/GenBank files
here is extremely important, but future commits will hopefully relax
this.
This helps readability of the --help content. We enforce this behaviour
by adding a check in the code following the example in `augur ancestral`
Simplifies the reading of the main function, and follows the established
pattern in this file of using a pair of functions (one for VCF, one for
JSON/FASTA). Added docstrings. The only code change is to throw an
AugurError rather than `return 1`, but the result is unchanged.
There should be no functional changes. Co-locating the sequence reading
& feature extraction is easier to read, and were Python's scoping to be
different it'd be even nicer as we wouldn't leave around variables which
are never re-used.

As part of this `translate_vcf_feature` has changed from using an
undocumented `return None` to raising a (documented) error which (IMO)
is easier to reason with.
@jameshadfield
Copy link
Member Author

Thanks for the review @victorlin

Merging despite CI failure, as that's unrelated to this PR - it's failing while running the pathogen-repo-rsv workflow:

 /home/runner/micromamba/envs/augur/bin/bash: line 2: nextclade3: command not found

@jameshadfield jameshadfield merged commit fc900aa into master Dec 20, 2023
23 of 24 checks passed
@jameshadfield jameshadfield deleted the james/ancestral-translate-fixes branch December 20, 2023 00:43
@jameshadfield
Copy link
Member Author

Unfortunately a few commits were squashed together in this PR. It's now been merged so I'm not going to change anything.

03c1965 contains what was meant to be three commits - a small refactor + the following commits

[translate] require nodes to have sequences

This is the implicit expectation, and is true for all of our pipelines.
In theory it's possible to translate without having the full sequence
attached to the node by reconstructing mutations on the root, but that
is not the approach taken by the current code.

As we explicitly check the tree nodes against the sequences in the
node-data JSON we can skip the automatic tests optionally performed when
reading the node-data JSON.

Closes #1345
diff --git a/CHANGES.md b/CHANGES.md
index 64f40df5..1cfe73a2 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -8,7 +8,7 @@
 * ancestral: Improvements to command line arguments. [#1344][] (@jameshadfield)
      * Incompatible arguments are now checked, especially related to VCF vs FASTA inputs. 
      * `--vcf-reference` and `--root-sequence` are now mutually exclusive.
-
+* translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
 * translate: Improvements to command line arguments.  [#1348][] (@jameshadfield)
     * `--tree` and `--ancestral-sequences` are now required arguments.
     * separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 461738b2..1fc73040 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -20,6 +20,7 @@ from .io.vcf import write_VCF_translation
 from .utils import read_node_data, load_features, write_json, get_json_name
 from treetime.vcf_utils import read_vcf
 from augur.errors import AugurError
+from textwrap import dedent
 
 class MissingNodeError(Exception):
     pass
@@ -339,7 +340,7 @@ def sequences_json(node_data_json, tree):
     Extract the full nuc sequence for each node in the provided node-data JSON.
     Returns a dict, keys are node names and values are a string of the genome sequence (nuc)
     """
-    node_data = read_node_data(node_data_json, tree)
+    node_data = read_node_data(node_data_json)
     if node_data is None:
         raise AugurError("could not read node data (incl sequences)")
     # extract sequences from node meta data
@@ -347,6 +348,14 @@ def sequences_json(node_data_json, tree):
     for k,v in node_data['nodes'].items():
         if 'sequence' in v:
             sequences[k] = v['sequence']
+    tree_nodes = {c.name for c in tree.find_clades()}
+    tree_nodes_missing_sequences = tree_nodes - set(sequences.keys())
+    if len(tree_nodes_missing_sequences):
+        raise AugurError(dedent(f"""\
+            {len(tree_nodes_missing_sequences)} nodes on the tree are missing nucleotide sequences in the node-data JSON.
+            These must be present under 'nodes' → <node_name> → 'sequence'.
+            This error may originate from using 'augur ancestral' with VCF input; in this case try using VCF output from that command here.
+            """))
     return sequences
 
 def register_parser(parent_subparsers):
@@ -412,7 +421,7 @@ def run(args):
         if len(features_without_variation):
             print("{} genes had no mutations and so have been be excluded.".format(len(features_without_variation)))  
     else:
-        sequences = sequences_json(args.ancestral_sequences, args.tree)
+        sequences = sequences_json(args.ancestral_sequences, tree)
         translations = {fname: translate_feature(sequences, feat) for fname, feat in features.items() if feat.type != 'source'}
 
     ## glob the annotations for later auspice export
[translate] skip parsing source (nuc) as a feature

This check is already in place for non-VCF inputs, and my guess is it
was omitted here as the TB pipeline's GFF file didn't include a 'source'
annotation. I don't think 'source' is actually a valid GFF ID and I
suspect we've just been applying the INSDC/GenBank term to GFF files,
but it is one of the two fields parsed by `load_features` and there are
GFF files in Nextstrain build pipelines which use it. Modifying the
underlying `load_features` would be a better solution, but that's a
bigger project for another day.

We additionally update the error message to use the same feature name we
export.

Closes #591
Supersedes #1109
diff --git a/CHANGES.md b/CHANGES.md
index 1cfe73a2..54b14f8d 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -9,6 +9,7 @@
      * Incompatible arguments are now checked, especially related to VCF vs FASTA inputs. 
      * `--vcf-reference` and `--root-sequence` are now mutually exclusive.
 * translate: Tree nodes are checked against the node-data JSON input to ensure sequences are present. [#1348][] (@jameshadfield)
+* translate: The 'source' ID for GFF files is now ignored as a potential gene feature. [#1348][] (@jameshadfield)
 * translate: Improvements to command line arguments.  [#1348][] (@jameshadfield)
     * `--tree` and `--ancestral-sequences` are now required arguments.
     * separate VCF-only arguments into their own group
diff --git a/augur/translate.py b/augur/translate.py
index 1fc73040..8c7253c3 100644
--- a/augur/translate.py
+++ b/augur/translate.py
@@ -129,7 +129,7 @@ def translate_feature(aln, feature):
     return translations
 
 
-def translate_vcf_feature(sequences, ref, feature):
+def translate_vcf_feature(sequences, ref, feature, feature_name):
     '''Translates a subsequence of input nucleotide sequences.
 
     Parameters
@@ -168,7 +168,7 @@ def translate_vcf_feature(sequences, ref, feature):
     # Need to get ref translation to store. check if multiple of 3 for sanity.
     # will be padded in safe_translate if not
     if len(refNuc)%3:
-        print("Gene length of {} is not a multiple of 3. will pad with N".format(feature.qualifiers['Name'][0]), file=sys.stderr)
+        print(f"Gene length of {feature_name!r} is not a multiple of 3. will pad with N", file=sys.stderr)
 
     ref_aa_seq = safe_translate(refNuc)
     prot['reference'] = ref_aa_seq
@@ -409,13 +409,16 @@ def run(args):
     print("Read in {} features from reference sequence file".format(len(features)))
 
     ## Read in sequences & for each sequence translate each feature _except for_ the source (nuc) feature
+    ## Note that `load_features` _only_ extracts {'gene', 'source'} for GFF files, {'CDS', 'source'} for GenBank.
     translations = {}
     if is_vcf:
         (sequences, ref) = sequences_vcf(args.vcf_reference, args.ancestral_sequences)
         features_without_variation = []
         for fname, feat in features.items():
+            if feat.type=='source':
+                continue
             try:
-                translations[fname] = translate_vcf_feature(sequences, ref, feat)
+                translations[fname] = translate_vcf_feature(sequences, ref, feat, fname)
             except NoVariationError:
                 features_without_variation.append(fname)
         if len(features_without_variation):
diff --git a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
index f0d25bee..1de2856e 100644
--- a/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
+++ b/tests/functional/translate/cram/translate-with-gff-and-locus-tag.t
@@ -15,7 +15,7 @@ Translate amino acids for genes using a GFF3 file where the gene names are store
   >   --output-node-data aa_muts.json \
   >   --alignment-output translations.vcf \
   >   --vcf-reference-output translations_reference.fasta
-  Gene length of rrs_Rvnr01 is not a multiple of 3. will pad with N
+  Gene length of 'rrs' is not a multiple of 3. will pad with N
   Read in 187 specified genes to translate.
   Read in 187 features from reference sequence file
   162 genes had no mutations and so have been be excluded.

@victorlin
Copy link
Member

@jameshadfield thanks for writing that! I've copied it into a comment on 03c1965.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants