Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duckdb query engine #10

Merged
merged 45 commits into from
Feb 22, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8e6032f
added basic interface for query engine
Mittmich Nov 19, 2023
3e8c4f1
Added explanations
Mittmich Nov 19, 2023
942e946
added query step protocol to query engine interface
Mittmich Nov 21, 2023
28fad15
Update duckdb version to 0.9.1
Mittmich Nov 21, 2023
6992d26
Add support for DuckDBPyRelation in dataframe
Mittmich Nov 21, 2023
96cf7f7
added examples to query engine interface
Mittmich Nov 24, 2023
bf5c53e
Refactor RegionFilter to Snipper
Mittmich Nov 26, 2023
e965eeb
Add GenomicDataSchema protocol and implement
Mittmich Nov 26, 2023
d990363
added single region selection implementation
Mittmich Dec 8, 2023
c06a7f4
Add anchor validation in Snipper class
Mittmich Dec 8, 2023
a45b1e5
blackify
Mittmich Dec 8, 2023
5d5347f
Refactor code and fix imports
Mittmich Dec 8, 2023
7d9cfd6
pixels snipping tests passing
Mittmich Dec 8, 2023
f056b3a
Refactor code formatting
Mittmich Dec 8, 2023
bef26e6
pylint and black
Mittmich Dec 8, 2023
a16f6e6
added documentation for snipping
Mittmich Dec 27, 2023
dac123f
updated documentation
Mittmich Dec 27, 2023
9c335d2
renamed region columns
Mittmich Dec 28, 2023
17f8ad8
untested offset transformation
Mittmich Dec 28, 2023
59b9765
regionoffset working
Mittmich Dec 29, 2023
fb58f9b
Remove unused imports and fixtures
Mittmich Dec 29, 2023
8f3be96
Refactor imports and fix formatting issues
Mittmich Dec 29, 2023
e1376b5
Update imports and ignore additional flake8 warnings
Mittmich Dec 29, 2023
540fc43
Add tests for offset aggregation functions in query engine
Mittmich Jan 3, 2024
be90c4f
added additional test for column name
Mittmich Jan 3, 2024
fb543ed
added tests for aggregation on dense input
Mittmich Jan 3, 2024
8add40d
added addiitonal tests for offsetaggregation
Mittmich Jan 7, 2024
50e0a83
Add region_number attribute to QueryStepDataSchema and ContactSchema;…
Mittmich Jan 13, 2024
9249542
added documentation
Mittmich Jan 21, 2024
430630b
updated documenatation page
Mittmich Jan 25, 2024
37cd211
added use case for contact counting
Mittmich Jan 28, 2024
cf0672e
implemented PR changes
Mittmich Feb 4, 2024
a073331
added duckdb parquet reader function
Mittmich Feb 4, 2024
93aeb22
added check for data mode
Mittmich Feb 4, 2024
92b24f1
Merge branch 'duckdb-transform-aggregation' into duckdb-query-engine
Mittmich Feb 4, 2024
b738f8e
added case whne regions are nested
Mittmich Feb 4, 2024
c58a7b2
renamed BasicQuery, QueryResult and Snipper
Mittmich Feb 4, 2024
e679062
fixed typos in query_engine_interface
Mittmich Feb 11, 2024
da256e1
code review comments for uery engine usage
Mittmich Feb 11, 2024
0218326
added string parsing for enums
Mittmich Feb 11, 2024
555e64e
added capability to specify a subset of positions for offsetaggregation
Mittmich Feb 11, 2024
a5c60b1
added explicit rejection of contacts to offsetaggregation
Mittmich Feb 20, 2024
2582252
added expliict windowsize handling
Mittmich Feb 20, 2024
6b2ad66
renamed window size
Mittmich Feb 21, 2024
94c8825
Renamed offset to distance
Mittmich Feb 21, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
521 changes: 367 additions & 154 deletions docs/data_structures.md

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions docs/query_engine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Query Engine

::: spoc.query_engine
199 changes: 199 additions & 0 deletions docs/query_engine_interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Interface query

This document defines the high-level interface of the spoc query engine.

## Class relationships

```mermaid
classDiagram

class gDataProtocol{
<<Protocol>>
+DataFrame data
+get_schema(): gDataSchema
}

class gDataSchema{
<<Protocol>>
+get_position_coordinates(): Maping~int~:~str~
}

class QueryStep{
<<Protocol>>
+validate_schema(schema)
-__call__() gDataProtocol
}

class BasicQuery
BasicQuery: +List~QueryStep~ query_plan
BasicQuery: +compose_with(BasicQuery) BasicQuery
BasicQuery: +query(gDataProtocol)

BasicQuery --> QueryResult : query result
BasicQuery "1" --* "*" Snipper
BasicQuery "1" --* "*" Aggregation
BasicQuery "1" --* "*" Transformation
gDataProtocol --> BasicQuery : takes input
gDataProtocol ..|> Pixels : realization
gDataProtocol ..|> Contacts : realization
gDataProtocol ..|> QueryResult : realization
QueryStep ..|> Snipper
QueryStep ..|> Transformation
QueryStep ..|> Aggregation


class Snipper {
+pd.DataFrame~Regions~
+String anchor_mode
+validate_schema(schema)
-__call__(data) gDataProtocol
}


class Aggregation{
+validate_schema(schema)
-__call__() gDataProtocol
}

class Transformation{
+validate_schema(schema)
-__call__() gDataProtocol
}

class QueryResult
QueryResult: +DataFrame data
QueryResult: get_schema() pandera.Schema
QueryResult: compute() gDataProtocol

```

## Description

- __gDataProtocol__: Protocol class that defines the interface of genomic data that can be accepted by `BasicQuery`. Implements a method to get it's schema as well as a parameter to get the underlying data
- __BasicQuery__: Central query class that encapsulates querying an object that implements the `gDataProtocol`. Holds references to a query plan, which is a list of filters, aggregations and transformations that are executed in order and specify the filtering, aggregation and transformation operations. Is composable with other basic query instances to capture more complex queries. Performs checks on the proposed operations based on the `get_schema()` method and the requested filters and aggregations.
- __QueryResult__: Result of a BasicQuery that implements the `gDataProtocol` and can either be computed, which manifests the query in memory, or passed to basic query again.
- __Filter__: Interface of a filter that is accepted by `BasicQuery` and encapsulates filtering along rows of genomic data.
- __Snipper__: A filter that filters for overlap with specific genomic regions that are passed to the constructor. Anchor refers to the way that the genomic regions are overlapped (e.g. at least one, exactly one, all, the first contact etc.)

## Examples

Example pseudocode for selected usecases.

### Selecting a subset of contacts at a locus for display

```python
from spoc.query_engine import Snipper, Anchor, BasicQuery
from spoc.contacts import Contacts
import pandas as pd

# load input
target_region = pd.read_csv("single_test_region.bed")
contacts = Contacts.from_uri("test_contacts.spoc::2")

# specify query plan -> Select contacts where all contacts overlap the
# specified region
query_plan = [
Snipper(target_region, anchor_mode=Anchor(mode="ALL"))
]

# instantiate query
query = BasicQuery(query_plan=query_plan)

# execute query

result = queyr.query(contacts)
result
#|> QueryResult

result.data
#|> duckdb.DuckDBPyrelation # not executed yet

result.compute()
#|> pd.DataFrame # executed
```

### Pileup of trans triplets

#### CC by T

Select 2d cis-pixels that are anchored by a trans contact

```python
from spoc.query_engine import (
Snipper,
PointFilter,
Anchor,
BasicQuery,
RegionOffsetTransformation,
Aggregation,
AggregationMode.
MappedRegionFilter
)
from spoc.pixels import Pixels
from spoc.utils import get_center_bin
import pandas as pd

# load input
target_regions = pd.read_csv("multiple_test_regions.bed")
target_regions_mid_points = get_center_bin(target_regions, bin_size=10_000)
# triplet pixels of AAB where binary lables have been equested and symmetry has
# been flipped
pixels = Pixels.from_uri("test_pixels.spoc::10000::3::AAB")

############################################
#### option 1: specify entire query plan####
############################################

query_plan = [
# select pixels where all bins in a triplet are contained in a region
Snipper(target_regions, anchor_mode=Anchor(mode="ALL", contacs=[0,1])),
# select pixels where the third contact overlaps a specific bin
MappedRegionFilter(transform=partial(get_center_bin, bin_size=10_000), anchor_mode=Anchor(mode="ALL", contacs=[2])),
# this transformation calculates the offset of a pixel to any containing target region
RegionOffsetTransformation(target_regions),
# this aggregation computes the sum of contacts per region and 2d coordinate
Aggregation(function='sum', mode=AggregationMode(['Region', 'Contact1', 'Contact2'])),
# this aggregation computes the average contacts per region contact1 and contact2 over all regions
Aggregtaion(function='average', mode=AggregationMode(['Contact1', 'Contact2']))
]

# instantiate query
query = BasicQuery(query_plan=query_plan)

# execute query

result = queyr.query(contacts)
result
#|> QueryResult

############################################
#### option 2: compose pieces ####
############################################

query_plan_1 = [
Snipper(target_regions, anchor_mode=Anchor(mode="ALL", contacs=[0,1])),
MappedRegionFilter(transform=partial(get_center_bin, bin_size=10_000), anchor_mode=Anchor(mode="ALL", contacs=[2])),
RegionOffsetTransformation(target_regions),
Aggregation(function='sum', mode=AggregationMode(['Region', 'Contact1', 'Contact2'])),
]

query1 = BasicQuery(query_plan=query_plan_1)

query_plan_2 = [
Aggregtaion(function='average', mode=AggregationMode(['Contact1', 'Contact2']))
]

query2 = BasicQuery(query_plan=query_plan_2)

# composing queries concatenates query plans

composed_query = query1.compose_with(query2)
composed_query.query_plan
#|> query_plan = [
#|> Snipper(target_regions, anchor_mode=Anchor(mode="ALL", contacs=[0,1])),
#|> MappedRegionFilter(transform=partial(get_center_bin, bin_size=10_000), anchor_mode=Anchor(mode="ALL", contacs=[2])),
#|> RegionOffsetTransformation(target_regions),
#|> Aggregation(function='sum', mode=AggregationMode(['Region', 'Contact1', 'Contact2'])),
#|> Aggregtaion(function='average', mode=AggregationMode(['Contact1', 'Contact2']))
#|> ]
```
Loading