Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

Open
tnguyengel opened this issue Apr 5, 2023 · 11 comments

Comments

@tnguyengel
Copy link

tnguyengel commented Apr 5, 2023

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ annotations with error:

ValueError: INFO field 'CSQ' is defined as Number '.', which is not supported.

as tested on sgkit v0.6.0.

Presumably, the method will also fail for any VCFs containing annotations with unbounded size. INFO/CSQ contains variant effect predictions from VEP. There can be multiple predictions for each allele, one for every transcript that an allele overlaps. Each prediction is separated by a comma. The number of predictions per allele is not known in advance, and so the INFO/CSQ field is defined with unbounded size in the header, or "Number=.":

For example:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|...>

It would be very useful to be able to filter a zarr for variants that are deemed clinically relevant according to annotation, such as loss of function variants.

Do you suggest any workarounds in the meantime?

@tnguyengel tnguyengel changed the title sgkit.io.vcf.vcf_reader.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations sgkit.io.vcf.vcf_to_zarr fails to convert VCFs with INFO/CSQ and other unbounded annotations Apr 5, 2023
@tnguyengel tnguyengel changed the title sgkit.io.vcf.vcf_to_zarr fails to convert VCFs with INFO/CSQ and other unbounded annotations sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations Apr 5, 2023
@timothymillar
Copy link
Collaborator

Hi @tnguyengel, I'm not familiar INFO/CSQ annotations but you can specify a maximum bound for a VCF field using the field_defs parameter.

from sgkit.io.vcf import vcf_to_zarr
vcf_to_zarr(
    "example.vcf.gz",
    "example.zarr",
    fields=["INFO/CSQ "],
    field_defs={"INFO/CSQ ": {"Number": 100}},  # or whatever a reasonable maximum is.
)

Essentially, the vcf_to_zarr function needs to know the maximum size of each field so that it can allocate space. If the field is unbounded ('.'), then you will need to manually provide an upper bound.

@tnguyengel
Copy link
Author

Thanks! That sounds like a quick workaround!

From brief googling, a human gene has ~4 transcripts on average, but can have up to ~1000 transcripts. If this is correct, in order to ensure that the INFO/CSQ field is not truncated in the zarr, I would need to set:

field_defs={"INFO/CSQ ": {"Number": 1000}}

Assuming that a typical variant allele does indeed have ~4 INFO/CSQ entries, but can have up to ~1000 INFO/CSQ entries, would I run the risk of bloating the zarr file size due to the INFO/CSQ field? Or does sgkit use sparse representation to handle fields with many zero/null/empty elements?

@jeromekelleher
Copy link
Collaborator

Good questions @tnguyengel! Any thoughts @tomwhite?

@tomwhite
Copy link
Collaborator

tomwhite commented Apr 6, 2023

This is a case where fixed-size arrays struggle, I'm afraid. The problem is that in Zarr you can have ragged arrays, or variable-length strings, but not both.

There's more discussion about this in #634 and the linked xarray issue at pydata/xarray#4285 (which seems to have some newer discussion). The good news is that there are lots of people who want to solve this issue.

In the meantime, you can either just store the first n entries (which is lossy), or store all of them (which wastes space).

Another trick is to use the undocumented zarr_array_sizes function to find the maximum number of entries in a VCF. It will parse the VCF twice, but this might be acceptable to find the max number of transcripts in a particular VCF:

from sgkit.io.vcf.vcf_reader import zarr_array_sizes
kwargs = zarr_array_sizes(path)
vcf_to_zarr(path, ..., **kwargs)

@benjeffery
Copy link
Collaborator

Looks like we could improve the error message here?

@tomwhite
Copy link
Collaborator

Looks like we could improve the error message here?

I've opened #1064 for this.

Also, @benjeffery reminded me that the zarr_array_sizes function does not run in parallel, so it will be slow on large datasets. There is an open issue to remedy that here: #734

@tomwhite
Copy link
Collaborator

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

@tomwhite
Copy link
Collaborator

BTW to filter the dataset, you can do something like:

variant_ids = ["rs6054257", "rs6040355"]
ds_filtered = ds.isel(variants=(ds.variant_id.isin(variant_ids)))

@tnguyengel
Copy link
Author

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

Thanks! The annotations were executed awhile ago and only output to VCF. Json is admittedly a much easier format to query than VCF. At the moment, we're playing around with splitvep output. If it's not too computationally intensive, reannotating to json might be more attractive.

We'll give the zarr_array_sizes approach a shot too!

@tomwhite
Copy link
Collaborator

We'll give the zarr_array_sizes approach a shot too!

The parallel version of zarr_array_sizes has just been merged so you might like to try that by installing from the GitHub main branch.

@benjeffery
Copy link
Collaborator

@tnguyengel Adding a username ping here in case the last message didn't reach your notifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants