extended cooler-like API for arbitrary data transformations (e.g. O/E) #486

sergpolly · 2024-01-21T04:24:19Z

sergpolly
Jan 21, 2024
Maintainer

wouldn't it be great if the following was possible ?

from cooltools import cooler_ext

# extended cooler-like object that can normalize HiC data seamlessly on the fly, it carries
# underlying cooler and expected_full calculated with the particular genome partitioning all together
clre = cooler_ext( clr, expected_full_df, ...)

region1 = ("chr1", 24_000_000, 45_000_000)
oe_mat = clre.matrix(normalize="balance.avg", sparse=False, as_pixels=False, ...).fetch(region1)
# oe_mat is an on-diagonal observed/expected matrix

region1 = ("chr2", 24_000_000, 45_000_000)
region2 = ("chr1", 124_000_000, 145_000_000)
oe_mat = clre.matrix(normalize="balance.avg", sparse=False, as_pixels=False, ...).fetch(region1, region2)
# oe_mat now is an off-diagonal, potentially inter-arm or even still cis-arm, or both observed/expected matrix

# ... trans, as_pixels, etc etc - should work and behave like reglar cooler - just normalized now !
# regular indexing - crossing chromosome boundaries should be also supported
oe_mat = clre.matrix(normalize="balance.avg")[3000:4000, 6000:8000]

Another potentially nice API related to the one above - either a pileup replacement or addition - pyBBI inspired:

from cooltools import cooler_ext

clre = cooler_ext( clr, expected_full_df, ...)

# whatever features- trans, cis, inter-arm - all 100kb by 100kb though
features_df = pd.DataFrame(
    "chrom1": ["chr1", "chr1", "chr4"],
    "start1": [25_000_000, 56_000_000, 10_000_000],
    "end1": [25_100_000,  56_100_000,  10_100_000],
    "chrom2": ["chr1", "chr1", "chr9"],
    "start2": [78_000_000, 176_000_000, 14_000_000],
    "end2": [78_100_000, 176_100_000,  14_100_000],
)

pileup_stack = clre.matrix(normalize=False).fetch_features(features_df)
# pileup_stack is what it should be - a 3D stack of snippets for provided regions !

oe_pileup_stack = clre.matrix(normalize="balance.avg").fetch_features(features_df)
# oe_pileup_stack - 3D stack of observed/expected snippets for provided regions !

# this is insipired by pybbi API for bigwigs
# why can't cooler pileups be as easy and intuitive as that ?

golobor · 2024-01-21T14:59:23Z

golobor
Jan 21, 2024
Maintainer

yes, the 1st part, a generic cooler-like API for data transformation, does look nice and makes sense to me!
@nvictus , didn't you have some kind of proposal API for generic transformations of coolers?

Re: 2nd part, cooler_ext.fetch_features, to me, it can be split into two parts:

make a snipping function that accepts cooler.matrix() or cooler_ext.matrix() object instead of an underlying cooler. This way, we could move the arguments related to balancing and O/E outside of the pileup function and into the matrix transformation part. Personally, I like this idea a lot!
move pileup() into cooler_ext.fetch_features(features_df, flank, min_diag, nproc). To me, this feels more arguable. As you've formulated, cooler_ext is a generic API for data transformation; snipping feels like completely unrelated functionality, so I'd vote for keeping it out of the hypothetical cooler_ext namespace. But I remain open to further arguments!

0 replies

sergpolly · 2024-01-21T18:37:14Z

sergpolly
Jan 21, 2024
Maintainer Author

make a snipping function that accepts cooler.matrix() or cooler_ext.matrix() object instead of an underlying cooler. This way, we could move the arguments related to balancing and O/E outside of the pileup function and into the matrix transformation part. Personally, I like this idea a lot!

Yes, ultimately pass such an extended cooler object internally around cooltools, because most analysis algorithms require O/E anyways.

This would potentially decouple the problem of the view_df_1 in a sense of genome partitioning for expected calculation and another view_df_2 for example in saddle - where intuitively this would imply - i want my saddle done on this subset of chroms or arms - regardless of genome partitioning that was used for expected - right now view_df_1 and view_df_2 are forced to be the same thing, which is confusing - @golobor brought it up long time ago, when we started introducing view_df into everything in cooltools ...

So in other words - this idea supersedes the idea of constructing a separate object/class for expected (that would be view_df aware) with its own API or whatever - scrape that - right ? - instead simply extend the cooler itself - by tying together a cooler (matrix data), view_df - a tiling genome partitioning for expected, and expected_full_df itself - enabling normalized querying, etc - ease of passing such an object internally around cooltools, where O/E is needed etc etc

This would work beautifully during a single jupyter-notebook-session , but it less clear to how to extend this to CLI:

write expected and genome partitioning to cooler itself - thus extended cooler HDF5 as well
- we could make it non-intrusive, such that plain cooler isn't even aware of the extra tables - only cooltools would know
write expected to a special multi-resolution HDF5 together with its view_df-genome partitioning (would be new expected storing format instead of CSV)
continues using CSV based expected for storage and passing around CLI - we would be simply constructing cooler_ext object inside the CLI before passing to the API inside. However here we face an issue of view_df for expected and view_df for restricting analysis ...

0 replies

sergpolly · 2024-01-21T22:47:08Z

sergpolly
Jan 21, 2024
Maintainer Author

there is a related PR into sandbox https://github.com/open2c/cooltools/pull/391/files -> this enables that full expected calculation for entire genome (given a genome partition in the provided view_df) - some of this is described in #280 (comment)

notebook in that PR demonstrates how we can combine expected with that cooler - to store in cooler-like file and I just remembered that we did a tiny proof of principle demo for 4DN - where we show on the fly Obs/Exp using python-higlass https://github.com/sergpolly/oe-tileset-example - taking bits and pieces from clodius/higlass-engine behind cooler visualization and modifying it ...

0 replies

gfudenberg · 2024-01-29T22:05:05Z

gfudenberg
Jan 29, 2024
Maintainer

I like the proposal !

in general, being able to fetch a obs/exp matrices for a bioFrame of regions (in some smart way) would be great as well as a bedPE-like (bioPairFrame?) set of pixels.

0 replies

sergpolly · 2024-01-31T21:48:31Z

sergpolly
Jan 31, 2024
Maintainer Author

all of a sudden a generic Observed/Expected fetcher(query engine) is needed for many things, including tests for current implementation of pileups , some visulaizations etc etc.

So, I was able to get a very simple fetcher going, still using mostly public cooler API i.e. using clr.pixels() to extract data and clr._load_dset("/indexes/bin1_offset") to extract relevant rows of the matrix only. It does only upper part of the matrix, without filling the lower part yet, and the API for now is such:

# instantiate CoolerExt object - bundle cooler, expected and view together
cext = CoolerExt(
  "test.cool",  # regular cooler URI
  expected = expected_df,  # conforming https://github.com/open2c/cooltools/issues/280
  view_df = view_arms_df,  # non-overlapping genome partitioning to regions
)

# now we can start fetching
# dense matrices
cext.normalized_matrix()[50:100, 50:100]
cext.normalized_matrix().fetch("chr1", "chr2")

# sparse matrices
cext.normalized_matrix(sparse=True).fetch("chr1")

# as pixels (but observed over expected of course !)
cext.normalized_matrix(as_pixels=True).fetch("chr1")

# underlying cooler (observed data !) can be still accessed as
cext.matrix().fetch("chr3")

# all of the cooler's methods/properties are accessible via cext still

corresponding jupyter walkthrough is here https://gist.github.com/sergpolly/2cb54bf77497823d2b69e73b12f6526a
This "solution" is based on inheritance - i.e. class CoolerExt(Cooler) - which makes cext have all of the cooler's methods and properties (+ the normalized_matrix fetch on top) - but perhaps this is "too much" ...

Maybe we should try composition instead - i.e. have CoolerExt(object) that simply bundles expected and a cooler - this one would change API this way:

# same way of accessing normalized data (observed/expected)
cext.normalized_matrix()[50:100, 50:100]

# BUT underlying cooler (observed data !) is accessed like so
cext.clr.matrix().fetch("chr3")

# underlying cooler/clr methods are no longer accessible via cext
# but can rewritten/modified

(!) Also - need to take into account that cooler_ext is defined within the view_df only ! i.e. it is typically narrower than the cooler/observed itself - things like n_bins in selector are affected and such

0 replies

gfudenberg · 2024-03-04T20:25:44Z

gfudenberg
Mar 4, 2024
Maintainer

from March 4 discussion:

should tools (snippers, dots, ) that accept an expected allow a expected table w/o region columns to be passed, which can be understood as the same across all regions?

0 replies

gfudenberg · 2024-05-10T18:18:59Z

gfudenberg
May 10, 2024
Maintainer

@WhittakerWave has some promising results that the sparse eigendecomposition from @sergpolly can be sped up by a factor of ~50x if the obsexp pixel values are computed once instead of on the fly.

Is https://github.com/open2c/cooltools/blob/master/cooltools/sandbox/observed_over_expected_example.ipynb still the best example of precomputing an obs/exp cooler?

cc @nvictus @golobor

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extended cooler-like API for arbitrary data transformations (e.g. O/E) #486

{{title}}

Replies: 7 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

extended cooler-like API for arbitrary data transformations (e.g. O/E) #486

sergpolly Jan 21, 2024 Maintainer

Replies: 7 comments

golobor Jan 21, 2024 Maintainer

sergpolly Jan 21, 2024 Maintainer Author

sergpolly Jan 21, 2024 Maintainer Author

gfudenberg Jan 29, 2024 Maintainer

sergpolly Jan 31, 2024 Maintainer Author

gfudenberg Mar 4, 2024 Maintainer

gfudenberg May 10, 2024 Maintainer

sergpolly
Jan 21, 2024
Maintainer

golobor
Jan 21, 2024
Maintainer

sergpolly
Jan 21, 2024
Maintainer Author

sergpolly
Jan 21, 2024
Maintainer Author

gfudenberg
Jan 29, 2024
Maintainer

sergpolly
Jan 31, 2024
Maintainer Author

gfudenberg
Mar 4, 2024
Maintainer

gfudenberg
May 10, 2024
Maintainer