-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove unused max length calculations #932
Conversation
Hmm, suspicious these aren't being used. Maybe we're just not hitting them in the test cases? Defer to @tomwhite here. |
The lines removed here are the only references to function-local variables, so unless there is some funky introspection code elsewhere, I don't see how they could be used. |
There is some introspection code - but it's not working as intended: Note that it's expecting The basic idea behind the code is to find the longest string (for variant IDs and alleles) so that when the zarr files are concatenated, the correct "S" dtype can be set. It would be great to improve this - ideally so none of this is necessary - but finding a path through that works with strings in NumPy, Zarr, and Xarray is tricky! I think the first thing to do is to work out why the |
So maybe that code is working correctly? |
@Mergifyio rebase |
✅ Branch has been successfully rebased |
e28f34c
to
0455682
Compare
Another complicating factor is that there are two code paths - one for scikit-allel Zarr files ("vcfzarr", code in vcfzarr_reader.py) and one for converting regular VCF files to Zarr files (code in sgkit/io/vcf). So it looks like the first one may be working, but the second hasn't been working. |
There's quite a bit of history here - the key PR which explains it is #665, and the simplification carried out later in #741 (to fix #678) was to remove the special cases needed to set the fixed string lengths. However, the code removed in this PR was missed there, so I think it's correct to remove it now. As far as testing is concerned, I'd like to add a test to check that some strings are as expected (i.e. they haven't been inadvertently truncated). Also, as I mentioned above, we still need the fixed-string-length code for the scikit-allel import case. |
Codecov Report
@@ Coverage Diff @@
## main #932 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 39 39
Lines 3275 3268 -7
=========================================
- Hits 3275 3268 -7
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@benjeffery if you are happy with the new test I think this is ready to merge. |
I'm working with TB VCF files, some of which have ~2000 alt alleles, requiring a high
max_alt_alleles
argument tovcf_to_zarr
.It seems VCF performance scales very poorly with this parameter.
Profiling showed that 40% of this time was spent running
max
invcf_to_zarr_sequential
, asmax
is called with amax_alt_alleles
length array for each variant. However the result of this operation isn't used (as far as I could tell) so this PR removes those lines.