Investigate ragged arrays to represent alleles #634

tomwhite · 2021-07-19T09:36:07Z

Alleles are a challenge to represent efficiently in fixed-length arrays. There are a couple of problems:

the number of alleles is not known until the whole VCF file has been processed
there can be a very wide variation in the number of alt alleles (most variants will have one, but a few could have thousands

Both these problems could be solved by using ragged arrays.

Zarr has support for ragged arrays, but these don't currently work with variable length strings (needed for alleles), and they don't fit the Xarray data model, which assumes fixed sized dimensions. There is a good discussion of the problem in pydata/xarray#4285, in the context of Awkward Array.

jeromekelleher · 2021-07-19T09:57:53Z

Tricky one. I guess an option we should consider also is a home-grown ragged array encoding for alleles (although it would be pretty icky to expose users to this). Could we (in principle) do an encoding ourselves and expose the alleles dataset variable as a numpy object array (or similar)?

tomwhite · 2021-07-19T10:04:43Z

Could we (in principle) do an encoding ourselves and expose the alleles dataset variable as a numpy object array (or similar)?

Yes, I think it would be worth trying this out to see if it's possible, and how it works in Xarray.

jeromekelleher · 2021-07-19T10:23:10Z

FWIW we've been doing this in tskit like this. I think the underlying encoding (a data and offset array) is basically the same as what Arrow uses, and it works well in practise.

ravwojdyla · 2021-07-22T19:10:40Z

FYI @tomwhite. Following today's "standup", "open sourcing" some of our internal comments about awkward array, these are likely outdated (they were made about a year ago) and might not fully apply in the sgkit context. It might be still useful tho.

From @eric-czech:

There aren't any docs for awkward-1.0 yet, just empty stubs at https://awkward-array.org, so I looked at the original project instead. Here are a few questions I wanted to answer:

How do we parallelize with awkward?

It looks like dask.Bag (or even just delayed) wrapping awkward would make the most sense. I was hoping initially that it would be possible for awkward to act on dask arrays, but it just converts everything to numpy afaict:

import pyarrow, awkward
import dask.array as da
arrow_buffer = pyarrow.ipc.open_file(open("tests/samples/exoplanets.arrow", "rb")).get_batch(0)
stars = awkward.fromarrow(arrow_buffer)
stars['id'] = da.arange(len(stars), chunks=3)
type(stars['id']) # -> numpy.ndarray

That would then mean that we'd have to use the awkward api within partitions, which would be like using Pandas without Dask.DataFrame. Have you found a deeper way to integrate the two @ravwojdyla? This looks promising in 1.0: https://awkward-array.org/how-to-create-lazy-dask.html but I'm not sure what to expect once they add anything there.

How do we infer the schema for a table stored as json objects in partitioned files?

I'm not sure on this one -- it looks we can get a layout for one file, but I don't know how we'd merge this across partitions:

import pyarrow, awkward
arrow_buffer = pyarrow.ipc.open_file(open("tests/samples/exoplanets.arrow", "rb")).get_batch(0)
stars = awkward.fromarrow(arrow_buffer)
stars.layout
 layout 
[           ()] Table(dec=layout[0], dist=layout[1], mass=layout[2], name=layout[3], planets=layout[4], ra=layout[5], radius=layout[6])
[            0]   ndarray(shape=2935, dtype=dtype('float64'))
[            1]   ndarray(shape=2935, dtype=dtype('float64'))
[            2]   ndarray(shape=2935, dtype=dtype('float64'))
[            3]   StringArray(content=layout[3, 0], generator=<function tostring at 0x11510bd30>, args=(<function decode at 0x1011ae9d0>,), kwargs={})
[         3, 0]     JaggedArray(starts=layout[3, 0, 0], stops=layout[3, 0, 1], content=layout[3, 0, 2])
[      3, 0, 0]       ndarray(shape=2935, dtype=dtype('int32'))
[      3, 0, 1]       ndarray(shape=2935, dtype=dtype('int32'))
...

I imagine there's a function somewhere in the API that would let us get the schema like the one that is saved in HDF5 metadata, but I can't find it.

import h5py, json
f = h5py.File("stars.hdf5", "w")
g = awkward.hdf5(f)
g['stars'] = stars
f.close()
f = h5py.File("stars.hdf5", "r")
json.loads(f["stars"]["schema.json"][:].tostring())['schema']
{'call': ['awkward', 'Table', 'frompairs'],
 'args': [{'pairs': [['dec',
     {'call': ['awkward', 'numpy', 'frombuffer'],
      'args': [{'read': '1'}, {'dtype': 'float64'}, {'json': 2935, 'id': 2}],
      'id': 1}],
    ['dist',
     {'call': ['awkward', 'numpy', 'frombuffer'],
      'args': [{'read': '3'}, {'dtype': 'float64'}, {'json': 2935, 'id': 4}],
      'id': 3}],
    ['mass',
     {'call': ['awkward', 'numpy', 'frombuffer'],
      'args': [{'read': '5'}, {'dtype': 'float64'}, {'json': 2935, 'id': 6}],
      'id': 5}],
...

How can we save nested tables?
- You can't do it with parquet
- You can with awkd, HDF5 or arrow buffers
  - awkd is a zip archive of columns as binary blobs (zlib compressed, though it looks like there are ways to use other compressors)
  - HDF5 serialization uses the same protocol as pickling
  - Note on how HDF5 is stored:
  The HDF5 format does not include columnar representations of arbitrary nested data, as awkward-array does, so what we're actually storing are plain Numpy arrays and the metadata necessary to reconstruct the awkward array
  - Arrow buffers can't be compressed since they're only meant for ephemeral communication / memory mapping
How do we join two tables?

Dask.Bag.join looks like a possibility except that Bag is intended to operate on individual records so I'm not sure how awkward wrapping batches of records would fit in. One possibility could be to call .tolist() on the awkard arrays, then Bag.flatten, and then use the Bag API functions as normal.

How do we do something like explode an inner table, group by something in that table, and produce a count?

I think this too only appears to be possible (like joins) by using awkward within to partitions to either directly create lists of python objects, or use Pandas as an intermediary, for generating individual dict objects or pandas dataframes that we then process with dask.Bag or dask.DataFrame.

From me:

I found a bit more documentation in the jupyter notebooks here.

Regarding lazy arrays and Dask integration, I believe they are referring to the VirtualArray and PartitionedArray, which I believe you can see used with Dask here, but even then afaiu we still need to operate on awkward array at the partition level (unless we further integrate it into Dask).

Regarding groupby/join - looking at how reducers were implemented in awkward, I am not sure we would be able to use awkward as a 1st class citizen in Dask without as you are saying - converting down to list/dict/etc.

tomwhite · 2023-04-26T11:49:42Z

A different way to approach this problem is to export the fields that need ragged string arrays to a different storage backend, such as Parquet, then use tools to query that dataset to produce a list of IDs that can be used to restrict the array data in Zarr. (I suggested this in #1059.)

I've explored this a bit more in this notebook: https://github.com/pystatgen/sgkit/blob/17a24cf8ad755bb499b3a4a0caeca15390ef1a7e/ragged.ipynb

And I've also written a small prototype to export a couple of VCF fields to Parquet in pystatgen/sgkit@517a67e. This could be easily extended to export all VCF fixed and INFO fields, which are the ones that would be useful for filtering.

Not sure what to do with this yet, just sharing as it might be useful.

tomwhite · 2023-05-04T09:24:47Z

I've generalised the Parquet export code in #1083

tomwhite added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Jul 19, 2021

tomwhite mentioned this issue Jul 19, 2021

Issue a warning if the number of alt alleles exceeds the maximum specified #620

Merged

tomwhite mentioned this issue Jul 26, 2021

Use variable-length strings for storing alleles in Zarr #643

Open

TomNicholas mentioned this issue Aug 10, 2022

Awkward array backend? pydata/xarray#4285

Open

tomwhite mentioned this issue Oct 7, 2022

maximise lossless compression of vcf_to_zarr #925

Open

tomwhite mentioned this issue Apr 6, 2023

sgkit.io.vcf.vcf_to_zarr() fails to convert VCFs with INFO/CSQ and other unbounded annotations #1059

Open

tomwhite mentioned this issue May 4, 2023

Add experimental vcf_to_parquet function #1083

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate ragged arrays to represent alleles #634

Investigate ragged arrays to represent alleles #634

tomwhite commented Jul 19, 2021

jeromekelleher commented Jul 19, 2021

tomwhite commented Jul 19, 2021

jeromekelleher commented Jul 19, 2021

ravwojdyla commented Jul 22, 2021

tomwhite commented Apr 26, 2023

tomwhite commented May 4, 2023

Investigate ragged arrays to represent alleles #634

Investigate ragged arrays to represent alleles #634

Comments

tomwhite commented Jul 19, 2021

jeromekelleher commented Jul 19, 2021

tomwhite commented Jul 19, 2021

jeromekelleher commented Jul 19, 2021

ravwojdyla commented Jul 22, 2021

tomwhite commented Apr 26, 2023

tomwhite commented May 4, 2023