Remove unused max length calculations #932

benjeffery · 2022-10-14T15:22:23Z

I'm working with TB VCF files, some of which have ~2000 alt alleles, requiring a high max_alt_alleles argument to vcf_to_zarr.
It seems VCF performance scales very poorly with this parameter.
Profiling showed that 40% of this time was spent running max in vcf_to_zarr_sequential, as max is called with a max_alt_alleles length array for each variant. However the result of this operation isn't used (as far as I could tell) so this PR removes those lines.

jeromekelleher · 2022-10-17T08:12:13Z

Hmm, suspicious these aren't being used. Maybe we're just not hitting them in the test cases? Defer to @tomwhite here.

benjeffery · 2022-10-17T09:08:14Z

Hmm, suspicious these aren't being used. Maybe we're just not hitting them in the test cases? Defer to @tomwhite here.

The lines removed here are the only references to function-local variables, so unless there is some funky introspection code elsewhere, I don't see how they could be used.

tomwhite · 2022-10-17T10:43:24Z

The lines removed here are the only references to function-local variables, so unless there is some funky introspection code elsewhere, I don't see how they could be used.

There is some introspection code - but it's not working as intended:

https://github.com/pystatgen/sgkit/blob/c7be7534f4db7fbef4e499f5acd4df10562928f8/sgkit/io/vcfzarr_reader.py#L346-L348

Note that it's expecting max_length_variant_id, while the code removed in this PR is max_variant_id_length.

The basic idea behind the code is to find the longest string (for variant IDs and alleles) so that when the zarr files are concatenated, the correct "S" dtype can be set.

It would be great to improve this - ideally so none of this is necessary - but finding a path through that works with strings in NumPy, Zarr, and Xarray is tricky!

I think the first thing to do is to work out why the max_length_variant_id/max_variant_id_length mismatch didn't call a failed test, and add one that does fail to expose the problem.

benjeffery · 2022-10-17T10:54:17Z

max_variant_id_length was never placed in the zarr attributes by this code. But max_length_variant_id is added here:

https://github.com/pystatgen/sgkit/blob/c7be7534f4db7fbef4e499f5acd4df10562928f8/sgkit/io/vcfzarr_reader.py#L260-L272

So maybe that code is working correctly?

benjeffery · 2022-10-17T10:58:14Z

@Mergifyio rebase

mergify · 2022-10-17T10:58:50Z

rebase

✅ Branch has been successfully rebased

tomwhite · 2022-10-17T11:01:14Z

So maybe that code is working correctly?

Another complicating factor is that there are two code paths - one for scikit-allel Zarr files ("vcfzarr", code in vcfzarr_reader.py) and one for converting regular VCF files to Zarr files (code in sgkit/io/vcf). So it looks like the first one may be working, but the second hasn't been working.

tomwhite · 2022-10-17T11:42:00Z

There's quite a bit of history here - the key PR which explains it is #665, and the simplification carried out later in #741 (to fix #678) was to remove the special cases needed to set the fixed string lengths. However, the code removed in this PR was missed there, so I think it's correct to remove it now.

As far as testing is concerned, I'd like to add a test to check that some strings are as expected (i.e. they haven't been inadvertently truncated).

Also, as I mentioned above, we still need the fixed-string-length code for the scikit-allel import case.

codecov-commenter · 2022-10-17T11:53:16Z

Codecov Report

Merging #932 (95f88e1) into main (ee9ccfb) will not change coverage.
The diff coverage is n/a.

@@            Coverage Diff            @@
##              main      #932   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           39        39           
  Lines         3275      3268    -7     
=========================================
- Hits          3275      3268    -7

Impacted Files	Coverage Δ
sgkit/io/vcf/vcf_reader.py	`100.00% <ø> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

tomwhite · 2022-10-17T11:58:13Z

@benjeffery if you are happy with the new test I think this is ready to merge.

benjeffery mentioned this pull request Oct 15, 2022

Vcf perf tweaks #933

Merged

jeromekelleher requested a review from tomwhite October 17, 2022 08:12

Remove unused max length calculations

0455682

hammer force-pushed the remove_unused_max_length branch from e28f34c to 0455682 Compare October 17, 2022 10:58

Add a test to check strings have not been truncated after concat_zarrs

95f88e1

tomwhite mentioned this pull request Oct 17, 2022

Optimise VCF import #936

Open

tomwhite approved these changes Oct 17, 2022

View reviewed changes

jeromekelleher added auto-merge Auto merge label for mergify test flight and removed auto-merge Auto merge label for mergify test flight labels Oct 17, 2022

tomwhite added the auto-merge Auto merge label for mergify test flight label Oct 18, 2022

Merge branch 'main' into remove_unused_max_length

95cd53c

mergify bot merged commit b0f70cd into sgkit-dev:main Oct 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unused max length calculations #932

Remove unused max length calculations #932

benjeffery commented Oct 14, 2022 •

edited

Loading

jeromekelleher commented Oct 17, 2022

benjeffery commented Oct 17, 2022

tomwhite commented Oct 17, 2022

benjeffery commented Oct 17, 2022 •

edited

Loading

benjeffery commented Oct 17, 2022

mergify bot commented Oct 17, 2022

tomwhite commented Oct 17, 2022

tomwhite commented Oct 17, 2022

codecov-commenter commented Oct 17, 2022 •

edited

Loading

tomwhite commented Oct 17, 2022

Remove unused max length calculations #932

Remove unused max length calculations #932

Conversation

benjeffery commented Oct 14, 2022 • edited Loading

jeromekelleher commented Oct 17, 2022

benjeffery commented Oct 17, 2022

tomwhite commented Oct 17, 2022

benjeffery commented Oct 17, 2022 • edited Loading

benjeffery commented Oct 17, 2022

mergify bot commented Oct 17, 2022

✅ Branch has been successfully rebased

tomwhite commented Oct 17, 2022

tomwhite commented Oct 17, 2022

codecov-commenter commented Oct 17, 2022 • edited Loading

Codecov Report

tomwhite commented Oct 17, 2022

benjeffery commented Oct 14, 2022 •

edited

Loading

benjeffery commented Oct 17, 2022 •

edited

Loading

codecov-commenter commented Oct 17, 2022 •

edited

Loading