Add concat algorithm parameter to vcf_to_zarr #365 #665
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an implementation of #365, which will hopefully fix #661. It switches the concat algorithm to use an optimized memory-efficient version (using the same code as #324). Users don't have to change anything to get reduced memory use.
I've tested it on a smaller dataset (1kg chr22), and it reduces memory use significantly, but it still needs more testing.
As a side effect of the way this is implemented, it also makes addressesing #643 straightforward (and in fact more natural than the special cases needed to set the fixed string lengths). Xarray is effectively bypassed, so it is easy to use Zarr's variable-length strings directly.
I wonder if we should remove the
xarray_internal
option entirely for that reason. (It's not the default, the optimized concat and rechunk algorithm is, but it's another code path to maintain.)There are still potentially some race conditions lurking - I think I saw the problem in #486 again while testing locally, however it's very hard to reproduce.