Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use variable-length strings for storing alleles in Zarr #643

Open
tomwhite opened this issue Jul 26, 2021 · 3 comments
Open

Use variable-length strings for storing alleles in Zarr #643

tomwhite opened this issue Jul 26, 2021 · 3 comments
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc

Comments

@tomwhite
Copy link
Collaborator

Currently we use fixed-length strings for storing alleles, but this is inefficient since the length is the size of the longest allele in the whole dataset.

For example, in some 1000 genomes data (chr22) I noticed that the longest allele is 414 base pairs, which means the data type is "S414" - unnecessarily large for the vast majority of variants. This meant that the variant_allele data took up 15 MB (compressed), rather than something like one tenth of that if a variable-length encoding were used (like scikit-allel does).

It would be worth investigating how we could use Zarr's variable-length strings (the number of alt alleles would remain fixed though, in contrast to #634). In particular, there may be some work to get this representation to work nicely with xarray.

@tomwhite tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jul 26, 2021
@timothymillar
Copy link
Collaborator

@tomwhite the freebayes variant caller is a good example for producing a range of allele lengths. The VCFs avialable here (see releases) were called with freebayes and have allele lengths of up to ~ 60bp. Note that these were called with --haplotype-length -1 in an attempt to minimize variant length.

@tomwhite
Copy link
Collaborator Author

It would be worth investigating how we could use Zarr's variable-length strings

In the case of writing Zarr sequentially from VCF, we already do this (it can be enforced by passing target_part_size=None to vcf_to_zarr). The parallel case concatenates a collection of Zarr files, and as a byproduct of that the variable-length strings are converted to fixed-length strings. We should change that.

@tomwhite
Copy link
Collaborator Author

tomwhite commented Sep 7, 2021

I'm not sure how to fix this, so I opened pydata/xarray#5769 with a minimal example.

tomwhite added a commit to tomwhite/sgkit that referenced this issue Sep 13, 2021
Use variable-length strings for storing alleles in Zarr sgkit-dev#643
mergify bot pushed a commit that referenced this issue Sep 16, 2021
Use variable-length strings for storing alleles in Zarr #643
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

No branches or pull requests

2 participants