Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unused max length calculations #932

Merged
merged 3 commits into from
Oct 18, 2022

Conversation

benjeffery
Copy link
Collaborator

@benjeffery benjeffery commented Oct 14, 2022

I'm working with TB VCF files, some of which have ~2000 alt alleles, requiring a high max_alt_alleles argument to vcf_to_zarr.
It seems VCF performance scales very poorly with this parameter.
Profiling showed that 40% of this time was spent running max in vcf_to_zarr_sequential, as max is called with a max_alt_alleles length array for each variant. However the result of this operation isn't used (as far as I could tell) so this PR removes those lines.

@benjeffery benjeffery mentioned this pull request Oct 15, 2022
@jeromekelleher
Copy link
Collaborator

Hmm, suspicious these aren't being used. Maybe we're just not hitting them in the test cases? Defer to @tomwhite here.

@benjeffery
Copy link
Collaborator Author

Hmm, suspicious these aren't being used. Maybe we're just not hitting them in the test cases? Defer to @tomwhite here.

The lines removed here are the only references to function-local variables, so unless there is some funky introspection code elsewhere, I don't see how they could be used.

@tomwhite
Copy link
Collaborator

The lines removed here are the only references to function-local variables, so unless there is some funky introspection code elsewhere, I don't see how they could be used.

There is some introspection code - but it's not working as intended:

https://github.com/pystatgen/sgkit/blob/c7be7534f4db7fbef4e499f5acd4df10562928f8/sgkit/io/vcfzarr_reader.py#L346-L348

Note that it's expecting max_length_variant_id, while the code removed in this PR is max_variant_id_length.

The basic idea behind the code is to find the longest string (for variant IDs and alleles) so that when the zarr files are concatenated, the correct "S" dtype can be set.

It would be great to improve this - ideally so none of this is necessary - but finding a path through that works with strings in NumPy, Zarr, and Xarray is tricky!

I think the first thing to do is to work out why the max_length_variant_id/max_variant_id_length mismatch didn't call a failed test, and add one that does fail to expose the problem.

@benjeffery
Copy link
Collaborator Author

benjeffery commented Oct 17, 2022

max_variant_id_length was never placed in the zarr attributes by this code. But max_length_variant_id is added here:

https://github.com/pystatgen/sgkit/blob/c7be7534f4db7fbef4e499f5acd4df10562928f8/sgkit/io/vcfzarr_reader.py#L260-L272

So maybe that code is working correctly?

@benjeffery
Copy link
Collaborator Author

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Oct 17, 2022

rebase

✅ Branch has been successfully rebased

@tomwhite
Copy link
Collaborator

So maybe that code is working correctly?

Another complicating factor is that there are two code paths - one for scikit-allel Zarr files ("vcfzarr", code in vcfzarr_reader.py) and one for converting regular VCF files to Zarr files (code in sgkit/io/vcf). So it looks like the first one may be working, but the second hasn't been working.

@tomwhite
Copy link
Collaborator

There's quite a bit of history here - the key PR which explains it is #665, and the simplification carried out later in #741 (to fix #678) was to remove the special cases needed to set the fixed string lengths. However, the code removed in this PR was missed there, so I think it's correct to remove it now.

As far as testing is concerned, I'd like to add a test to check that some strings are as expected (i.e. they haven't been inadvertently truncated).

Also, as I mentioned above, we still need the fixed-string-length code for the scikit-allel import case.

@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2022

Codecov Report

Merging #932 (95f88e1) into main (ee9ccfb) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##              main      #932   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           39        39           
  Lines         3275      3268    -7     
=========================================
- Hits          3275      3268    -7     
Impacted Files Coverage Δ
sgkit/io/vcf/vcf_reader.py 100.00% <ø> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@tomwhite
Copy link
Collaborator

@benjeffery if you are happy with the new test I think this is ready to merge.

@tomwhite tomwhite mentioned this pull request Oct 17, 2022
@jeromekelleher jeromekelleher added auto-merge Auto merge label for mergify test flight and removed auto-merge Auto merge label for mergify test flight labels Oct 17, 2022
@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Oct 18, 2022
@mergify mergify bot merged commit b0f70cd into sgkit-dev:main Oct 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants