sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

tnguyengel · 2023-04-05T20:57:37Z

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ annotations with error:

ValueError: INFO field 'CSQ' is defined as Number '.', which is not supported.

as tested on sgkit v0.6.0.

Presumably, the method will also fail for any VCFs containing annotations with unbounded size. INFO/CSQ contains variant effect predictions from VEP. There can be multiple predictions for each allele, one for every transcript that an allele overlaps. Each prediction is separated by a comma. The number of predictions per allele is not known in advance, and so the INFO/CSQ field is defined with unbounded size in the header, or "Number=.":

For example:

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|...>

It would be very useful to be able to filter a zarr for variants that are deemed clinically relevant according to annotation, such as loss of function variants.

Do you suggest any workarounds in the meantime?

The text was updated successfully, but these errors were encountered:

timothymillar · 2023-04-05T22:37:10Z

Hi @tnguyengel, I'm not familiar INFO/CSQ annotations but you can specify a maximum bound for a VCF field using the field_defs parameter.

from sgkit.io.vcf import vcf_to_zarr
vcf_to_zarr(
    "example.vcf.gz",
    "example.zarr",
    fields=["INFO/CSQ "],
    field_defs={"INFO/CSQ ": {"Number": 100}},  # or whatever a reasonable maximum is.
)

Essentially, the vcf_to_zarr function needs to know the maximum size of each field so that it can allocate space. If the field is unbounded ('.'), then you will need to manually provide an upper bound.

tnguyengel · 2023-04-06T08:25:10Z

Thanks! That sounds like a quick workaround!

From brief googling, a human gene has ~4 transcripts on average, but can have up to ~1000 transcripts. If this is correct, in order to ensure that the INFO/CSQ field is not truncated in the zarr, I would need to set:

field_defs={"INFO/CSQ ": {"Number": 1000}}

Assuming that a typical variant allele does indeed have ~4 INFO/CSQ entries, but can have up to ~1000 INFO/CSQ entries, would I run the risk of bloating the zarr file size due to the INFO/CSQ field? Or does sgkit use sparse representation to handle fields with many zero/null/empty elements?

jeromekelleher · 2023-04-06T08:49:15Z

Good questions @tnguyengel! Any thoughts @tomwhite?

tomwhite · 2023-04-06T10:18:57Z

This is a case where fixed-size arrays struggle, I'm afraid. The problem is that in Zarr you can have ragged arrays, or variable-length strings, but not both.

There's more discussion about this in #634 and the linked xarray issue at pydata/xarray#4285 (which seems to have some newer discussion). The good news is that there are lots of people who want to solve this issue.

In the meantime, you can either just store the first n entries (which is lossy), or store all of them (which wastes space).

Another trick is to use the undocumented zarr_array_sizes function to find the maximum number of entries in a VCF. It will parse the VCF twice, but this might be acceptable to find the max number of transcripts in a particular VCF:

from sgkit.io.vcf.vcf_reader import zarr_array_sizes
kwargs = zarr_array_sizes(path)
vcf_to_zarr(path, ..., **kwargs)

benjeffery · 2023-04-06T11:14:30Z

Looks like we could improve the error message here?

tomwhite · 2023-04-18T09:20:20Z

Looks like we could improve the error message here?

I've opened #1064 for this.

Also, @benjeffery reminded me that the zarr_array_sizes function does not run in parallel, so it will be slow on large datasets. There is an open issue to remedy that here: #734

tomwhite · 2023-04-24T16:21:55Z

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

tomwhite · 2023-04-25T08:05:03Z

BTW to filter the dataset, you can do something like:

variant_ids = ["rs6054257", "rs6040355"]
ds_filtered = ds.isel(variants=(ds.variant_id.isin(variant_ids)))

tnguyengel · 2023-04-25T11:53:30Z

@tnguyengel I noticed that VEP allows you to export annotations as a JSON file, so as another way to approach the problem I wondered if you could use that to generate a list of variant IDs (using Pandas, or DuckDB, or ...), then use that list to subset the sgkit dataset?

Thanks! The annotations were executed awhile ago and only output to VCF. Json is admittedly a much easier format to query than VCF. At the moment, we're playing around with splitvep output. If it's not too computationally intensive, reannotating to json might be more attractive.

We'll give the zarr_array_sizes approach a shot too!

tomwhite · 2023-04-25T12:01:02Z

We'll give the zarr_array_sizes approach a shot too!

The parallel version of zarr_array_sizes has just been merged so you might like to try that by installing from the GitHub main branch.

benjeffery · 2023-04-28T13:46:23Z

@tnguyengel Adding a username ping here in case the last message didn't reach your notifications.

tnguyengel changed the title ~~sgkit.io.vcf.vcf_reader.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations~~ sgkit.io.vcf.vcf_to_zarr fails to convert VCFs with INFO/CSQ and other unbounded annotations Apr 5, 2023

tnguyengel changed the title ~~sgkit.io.vcf.vcf_to_zarr fails to convert VCFs with INFO/CSQ and other unbounded annotations~~ sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations Apr 5, 2023

tomwhite mentioned this issue Apr 18, 2023

Give hint for how to work around Number='.' error when reading VCF #1064

Merged

2 tasks

tomwhite mentioned this issue Apr 26, 2023

Investigate ragged arrays to represent alleles #634

Open

tomwhite mentioned this issue May 4, 2023

Add experimental vcf_to_parquet function #1083

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

tnguyengel commented Apr 5, 2023 •

edited

Loading

timothymillar commented Apr 5, 2023

tnguyengel commented Apr 6, 2023

jeromekelleher commented Apr 6, 2023

tomwhite commented Apr 6, 2023

benjeffery commented Apr 6, 2023

tomwhite commented Apr 18, 2023

tomwhite commented Apr 24, 2023

tomwhite commented Apr 25, 2023

tnguyengel commented Apr 25, 2023

tomwhite commented Apr 25, 2023

benjeffery commented Apr 28, 2023

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

Comments

tnguyengel commented Apr 5, 2023 • edited Loading

timothymillar commented Apr 5, 2023

tnguyengel commented Apr 6, 2023

jeromekelleher commented Apr 6, 2023

tomwhite commented Apr 6, 2023

benjeffery commented Apr 6, 2023

tomwhite commented Apr 18, 2023

tomwhite commented Apr 24, 2023

tomwhite commented Apr 25, 2023

tnguyengel commented Apr 25, 2023

tomwhite commented Apr 25, 2023

benjeffery commented Apr 28, 2023

tnguyengel commented Apr 5, 2023 •

edited

Loading