Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concat algorithm parameter to vcf_to_zarr #365 #665

Merged
merged 4 commits into from
Sep 16, 2021

Conversation

tomwhite
Copy link
Collaborator

This is an implementation of #365, which will hopefully fix #661. It switches the concat algorithm to use an optimized memory-efficient version (using the same code as #324). Users don't have to change anything to get reduced memory use.

I've tested it on a smaller dataset (1kg chr22), and it reduces memory use significantly, but it still needs more testing.

As a side effect of the way this is implemented, it also makes addressesing #643 straightforward (and in fact more natural than the special cases needed to set the fixed string lengths). Xarray is effectively bypassed, so it is easy to use Zarr's variable-length strings directly.

I wonder if we should remove the xarray_internal option entirely for that reason. (It's not the default, the optimized concat and rechunk algorithm is, but it's another code path to maintain.)

There are still potentially some race conditions lurking - I think I saw the problem in #486 again while testing locally, however it's very hard to reproduce.

Use variable-length strings for storing alleles in Zarr sgkit-dev#643
@jeromekelleher
Copy link
Collaborator

Haven't thought hard about this @tomwhite, but I'd be happy to get rid of the xarray_internal code path if there isn't a compelling performance reason for keeping it. What's to be gained from keeping it?

@tomwhite
Copy link
Collaborator Author

I'd be happy to get rid of the xarray_internal code path if there isn't a compelling performance reason for keeping it. What's to be gained from keeping it?

There's no performance reason for keeping it (it's slower and causes memory errors). If/when it's fixed upstream it would be straightforward to reintroduce it, so I think we can remove it.

@codecov-commenter
Copy link

codecov-commenter commented Sep 14, 2021

Codecov Report

Merging #665 (c37c77f) into main (19f0b1b) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #665   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           36        36           
  Lines         2886      2913   +27     
=========================================
+ Hits          2886      2913   +27     
Impacted Files Coverage Δ
sgkit/io/vcf/__init__.py 100.00% <100.00%> (ø)
sgkit/io/vcf/vcf_reader.py 100.00% <100.00%> (ø)
sgkit/io/vcfzarr_reader.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19f0b1b...c37c77f. Read the comment docs.

@tomwhite
Copy link
Collaborator Author

I think this can be merged as a first step, with the work to remove the xarray_internal code path done in a follow on PR.

I spent a bit of time running this on a 1GB VCF file (1kg chr2) and it worked fine.

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Merge away

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants