Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue a warning if the number of alt alleles exceeds the maximum specified #620

Merged
merged 3 commits into from
Jul 29, 2021

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Jul 6, 2021

Fixes #487. The maximum number of alleles found while parsing is stored in ds.attrs["max_alt_alleles_seen"] (in all cases).

@tomwhite tomwhite force-pushed the vcf-max-alt-allele-warning branch from 7cf8b97 to 439ed89 Compare July 6, 2021 11:01
@codecov-commenter
Copy link

codecov-commenter commented Jul 6, 2021

Codecov Report

Merging #620 (45aafb4) into main (5f43ecc) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #620   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           35        35           
  Lines         2815      2829   +14     
=========================================
+ Hits          2815      2829   +14     
Impacted Files Coverage Δ
sgkit/io/utils.py 100.00% <100.00%> (ø)
sgkit/io/vcf/__init__.py 100.00% <100.00%> (ø)
sgkit/io/vcf/vcf_reader.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f43ecc...45aafb4. Read the comment docs.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about marking subsequent alleles as missing data - I think this would be easier to document and lead to fewer bugs in the long run.

@@ -640,7 +652,8 @@ def vcf_to_zarr(
specified ploidy will raise an exception.
max_alt_alleles
The (maximum) number of alternate alleles in the VCF file. Any records with more than
this number of alternate alleles will have the extra alleles dropped.
this number of alternate alleles will have the extra alleles dropped (the `variant_allele`
variable will be truncated). Call genotype fields will however be unaffected.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in this case the user may get an index error if they use alleles[genotypes[..]], right? It might not be obvious to people that this just means they won't be able to find out the allelic value that corresponds to the genotype value.

Code may break downstream in unexpected ways because of this as we violating a basic invariant of the data model. Maybe it would be better to mark subsequent alleles as missing data (-1)? At least well-written code would deal with this in a predictable way, whereas having to deal with arbitrary missing alleles at every step could be quite tedious.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this @eric-czech?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 handling call data. I wonder if there should be a flag to handle the alt allele overflow - with options: error, turncate, freq_truncate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought I had was that we could store the entire ALT field as another variable (of type string), which could be used for reconstructing more alleles without having to regenerate the whole dataset. This might be orthogonal to what we are discussing here though.

I also wondered if we could use Zarr ragged arrays, but I don't think you can combine them with variable length strings - and even if you could they don't play well with the general array data model, which assumes fixed array dimensions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ragged arrays would probably be the best solution in the long run, but that's a big change, as you say. Maybe we should break that discussion off into an issue, so we can explore the pros and cons?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, if we stored the ALT field as another variable (just a single comma-delimited string, not a ragged array), then I think there is more of a case for keeping all the allele values in the genotype field, since it would be possible to find the allele string if needed. (This would also allow the fixed length alleles field to be regenerated if needed, without having to read all the data in from scratch.)

We could document the contract here (i.e. the possibility of indexing out of the alleles array), and also a pattern of how to use xr.where to ensure the genotypes don't index out of the alleles array. This would effectively do the truncation when accessing the data.

I'm also a bit wary of marking some genotypes as missing with value -1, since then it's not possible to distinguish if they were missing in the original file, or because they were truncated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the overflow issue @tomwhite? #640

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! That's a strong argument for truncating.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed - clinches it in my view. We don't want two different behaviours when having different flavours of "allele out of bounds", do we?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now fixed in the latest commit.

@mergify
Copy link
Contributor

mergify bot commented Jul 8, 2021

This PR has conflicts, @tomwhite please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Jul 8, 2021
@tomwhite
Copy link
Collaborator Author

Rebased, with no other changes for the time being.

@mergify mergify bot removed the conflict PR conflict label Jul 26, 2021
@mergify
Copy link
Contributor

mergify bot commented Jul 29, 2021

This PR has conflicts, @tomwhite please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Jul 29, 2021
@mergify mergify bot removed the conflict PR conflict label Jul 29, 2021
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Jul 29, 2021
@mergify mergify bot merged commit f487994 into sgkit-dev:main Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue warning if more alternate alleles than max_alt_alleles when reading VCF
5 participants