Duckdb query engine #10

Mittmich · 2023-12-27T12:49:37Z

This PR implements this ticket. It suggests the class structure of the query engine (can be found in docs\query_engine_interface.md) and provides and implementation using duckdb. A good place to start the review is the usage documentation that can be found here: docs\query_engine_usage.md. Looking forward to your comment!

models and query engine

position field and contact order methods

dmitrymyl · 2024-01-12T12:05:13Z

Very much thanks @Mittmich for the pull request! I am looking at it now and will split my review in two parts: concepts and code.

Mittmich · 2024-02-04T13:18:10Z

Hey @dmitrymyl,
thanks for the thorough review! Regarding your points:

Naming suggestions

query = BasicQuery([...]); result = query.query(dataset). query.query sounds redundant. Maybe we should just implement call to make it query(dataset)? Although, this stuff is a bit advanced python feature, less experienced users might not be familiar with it and might find it confusing.

I totally agree with this! This also ties into the naming of BasicQuery since it's a very fundamental object. I would be in favor of naming it Query and having a build method that is equivalent to the current query method?

2 .result.compute() sounds better and in line with dask than result.load_result().

I agree with this and will rename.

Snipper, Aggregation, Transformation — imo verbs would be shorter to type and they are also in line with SQL syntax (thus helping with adaption to our query engine). So I suggest to rename all QueryStep into verbs like Snip, Aggregate,

Sounds good! I already started to implement some of this in the duckdb-transform-aggregation branch, so this fits well.

Snipping naming

Yes, I also think that snipping is jargon and we should find something else. I don't like Select that much as - as you say - has another meaning in SQL. I like overlap as a name. I would then also merge the single and multiregion capabilities into one class and make the behavior dependent on the passed input.

In the docs, Filter along the rows — I have always found this terminology confusing in pandas. Does it subset rows or subset columns? I suspect the former. It could also be rewritten from the perspective of contacts and individual fragments, but this is not crucial ATM.

Yeah, this is meant to subset rows. We could also capture this by implementing a RegionOffsetOverlap to allow overlapping with relative regions that have been added already. Then we wouldn't need to talk about rows or columns, but concepts.

Questions

What are get_position_coordinates() of gDataSchema? It was not described in the UML diagram (maybe it will be clear after I check the code).

This is a method that returns the position coordinates from a GenomicDataSchema. This allows to validate schemas during construction of queries.

Mermaid diagram: should all arrows be in reverse? It's the case with UML diagrams, but maybe syntax is different here. Anyway, it doesn't matter for the project per se.

Hmm, at least for the realization relationships, the arrow points towards the class that realizes a protocol (https://www.ibm.com/docs/en/dma?topic=diagrams-realization-relationships), so I think that should be fine?

It was a bit confusing that QueryResult is an output of the BasicQuery and implements the gDataProtocol at the same time. Initially, I was thinking that the QueryResult is an actual materialization, like a 2D pileup matrix. But it is actually a lazy evaluated result, not the data. Then it makes sense to say it implements the gDataProtocol.

Ah yes, maybe we can also rename the QueryResult to something like QueryPlan? Then we can rename the argument query_plan of BasicQuery (which we will rename as Query) to something like query_steps. Then the Query object will get a list of query steps and it's build method generates a QueryPlan, which has a compute method. What do you think?

However, I am still not sure that the QueryResult should implement the gDataProtocol. It makes sense since the BasicQuery can accept the QueryResult. But what if the QueryResult actually relates to smth not being Pixels or Contacts, like 2D pileup matrix? What sense is there for it to follow the gDataProtocol? As an alternative, QueryResult might or might not implement the gDataProtocol to accomodate for this ambiguity.

The GenomicDataProtocol is quite flexible as it only contains methods to get aspects of the schema of the data. Whether that schema is compatible with other query steps depends on the specific query step and their validation method. So in theory, we can have 2d pixels, but basically, no query steps could accept it as input, which is fine since it would be more of a result.

BasicQuery.query() builds the query plan for postponed execution. The query plan is specified in the class instance and is applied to a dataset. This implies that query steps are independent of the dataset. Can it be otherwise: that features of the dataset (shape, contact orders etc) dynamically dictate the specific choice of procedures (in a way that user doesn't know that beforehand and cannot write another query plan)?

Hmm, that is a good question! I am not sure whether this is the case... And the steps do have access to the schema that all the query steps produce as building the query plan involves pre-building the schema.

Merging of branches

I think it's great that you already looked at the code in the duckdb-transform-aggregation branch! What I would propose is that I merge that branch here and we discuss all your comments together. Otherwise we will have many discussions that need to refer to the other branch.

Regarding your comments: I can't really see them, unfortunately, did you add them here?

dmitrymyl

Okay, with this comment I try to make my review comments visible...

spoc/contacts.py

spoc/io.py

spoc/models/dataframe_models.py

spoc/query_engine.py

dmitrymyl

Cool! I added a couple of comments here as well.

spoc/io.py

dmitrymyl · 2024-02-04T16:07:54Z

Hey @Mittmich, thanks for the replies and quick fixes!

Naming suggestions

query = BasicQuery([...]); result = query.query(dataset). query.query sounds redundant. Maybe we should just implement call to make it query(dataset)? Although, this stuff is a bit advanced python feature, less experienced users might not be familiar with it and might find it confusing.

I totally agree with this! This also ties into the naming of BasicQuery since it's a very fundamental object. I would be in favor of naming it Query and having a build method that is equivalent to the current query method?

Sounds very good!

2 .result.compute() sounds better and in line with dask than result.load_result().

I agree with this and will rename.

Great

Snipper, Aggregation, Transformation — imo verbs would be shorter to type and they are also in line with SQL syntax (thus helping with adaption to our query engine). So I suggest to rename all QueryStep into verbs like Snip, Aggregate,

Sounds good! I already started to implement some of this in the duckdb-transform-aggregation branch, so this fits well.

Perfect

Snipping naming

Yes, I also think that snipping is jargon and we should find something else. I don't like Select that much as - as you say - has another meaning in SQL. I like overlap as a name. I would then also merge the single and multiregion capabilities into one class and make the behavior dependent on the passed input.

Overlap and MultiOverlap sound good, I also like them.

In the docs, Filter along the rows — I have always found this terminology confusing in pandas. Does it subset rows or subset columns? I suspect the former. It could also be rewritten from the perspective of contacts and individual fragments, but this is not crucial ATM.

Yeah, this is meant to subset rows. We could also capture this by implementing a RegionOffsetOverlap to allow overlapping with relative regions that have been added already. Then we wouldn't need to talk about rows or columns, but concepts.

RegionOffsetOverlap sounds interesting, this might be a nice feature, but let's think about it later :)

Questions

What are get_position_coordinates() of gDataSchema? It was not described in the UML diagram (maybe it will be clear after I check the code).

This is a method that returns the position coordinates from a GenomicDataSchema. This allows to validate schemas during construction of queries.

Got it

Mermaid diagram: should all arrows be in reverse? It's the case with UML diagrams, but maybe syntax is different here. Anyway, it doesn't matter for the project per se.

Hmm, at least for the realization relationships, the arrow points towards the class that realizes a protocol (https://www.ibm.com/docs/en/dma?topic=diagrams-realization-relationships), so I think that should be fine?

I think that in the link you sent the arrow points to the class that specifies the protocol...

It was a bit confusing that QueryResult is an output of the BasicQuery and implements the gDataProtocol at the same time. Initially, I was thinking that the QueryResult is an actual materialization, like a 2D pileup matrix. But it is actually a lazy evaluated result, not the data. Then it makes sense to say it implements the gDataProtocol.

Ah yes, maybe we can also rename the QueryResult to something like QueryPlan? Then we can rename the argument query_plan of BasicQuery (which we will rename as Query) to something like query_steps. Then the Query object will get a list of query steps and it's build method generates a QueryPlan, which has a compute method. What do you think?

That actually sounds cool!

However, I am still not sure that the QueryResult should implement the gDataProtocol. It makes sense since the BasicQuery can accept the QueryResult. But what if the QueryResult actually relates to smth not being Pixels or Contacts, like 2D pileup matrix? What sense is there for it to follow the gDataProtocol? As an alternative, QueryResult might or might not implement the gDataProtocol to accomodate for this ambiguity.

The GenomicDataProtocol is quite flexible as it only contains methods to get aspects of the schema of the data. Whether that schema is compatible with other query steps depends on the specific query step and their validation method. So in theory, we can have 2d pixels, but basically, no query steps could accept it as input, which is fine since it would be more of a result.

Okay, I got it.

BasicQuery.query() builds the query plan for postponed execution. The query plan is specified in the class instance and is applied to a dataset. This implies that query steps are independent of the dataset. Can it be otherwise: that features of the dataset (shape, contact orders etc) dynamically dictate the specific choice of procedures (in a way that user doesn't know that beforehand and cannot write another query plan)?

Hmm, that is a good question! I am not sure whether this is the case... And the steps do have access to the schema that all the query steps produce as building the query plan involves pre-building the schema.

Okay, let's think that this is not the case. But maybe we should write that down somewhere to not forget this possibility.

Merging of branches

I think it's great that you already looked at the code in the duckdb-transform-aggregation branch! What I would propose is that I merge that branch here and we discuss all your comments together. Otherwise we will have many discussions that need to refer to the other branch.

Yeah, that'll be convenient

dmitrymyl

Nice!

Mittmich · 2024-02-04T21:34:28Z

Alright, I merged the other branch and renamed the classes as discussed :) Would be great if you could also have a look at the other functionality (Described in the query_engine_usage.ipynb notebook)! Thanks for the help!

dmitrymyl

@Mittmich I checked new code except for tests and implementations of RegionOffsetTransformation and OffsetAggregation (will do later). I added my comments to code and some general notes below:

Now the query engine emerges as a powerful tool! I think now we have a good example of how we can use it (with pileups). I am curious how easy it will be to add new steps to it -- maybe some contribution docs in the (not so near) future?

I rarely see enums in public API of scientific packages, so they feel inconvenient to me. I would design their functionality as str kwarg with nested ifs, but then there is a problem of repetitive code and unclear specifications.
There are no performance queries like setting numbers of cores and memory limit. These params could be kwargs for the Query class.
Offset coordinates are added separately, interesting implementation, but makes sense for downstream steps.
Do we have support for the normalized data? Like "normalized_count" or smth.
RegionOffsetTransformation and OffsetAggregation are nouns, should we switch to verbs? But it will be lengthy anyway...
Should we maybe make "stub" classes for Transform and Aggregate and inherit from them to establish the nomenclature of steps in the codebase?

I feel like I should update the nomenclature of query steps in my doc in Teams so that we could check which steps we have already implemented and which are left.

Otherwise looks great, thanks as usual!

dmitrymyl · 2024-02-09T13:28:48Z

spoc/contacts.py

-from typing import List, Optional, Dict
-import pandas as pd
+
+from itertools import permutations


What's the reason for splitting imports? :)

It's done by reorder-python-imports in the pre-commit hooks. It separates builtin from third-party from imports in this repo. It's very handy because it reorders imports automatically :)

Ah, I meant that before there were multiple objects imported per line (List, Optional, Dict), and now they are imported in individual lines — but if that's an automatic behaviour, then it doesn't matter :)

spoc/contacts.py

spoc/models/dataframe_models.py

spoc/pixels.py

.pre-commit-config.yaml

docs/query_engine_usage.md

dmitrymyl · 2024-02-09T13:59:44Z

docs/query_engine_usage.md

@@ -349,3 +349,386 @@ BasicQuery(query_plan=query_plan)\
 In this example, the contact overlapping both regions is duplicated.


This is due to region IDs, right? I am curious, if we would need a type of output when there is no regions and we need only unique contacts. This can easily be done with pandas (drop columns, drop duplicates), so I don't think we have to implement that in the query engine for now.

Indeed, it is because the there are regions that are overlapping and we therefore report contacts that fall within both regions twice. I think it is a good idea to determine whether we want that behavior. One could imagine to adding it as a parameter to the Overlap class, perhaps as we described in the concept for multioverlap:

MultiOverlap( regions=[loop_bases_left, loop_bases_right], # Regions to overlap add_overlap_columns=False, # Whether to add the regions as additional columns anchor=MultiAnchor( fragment_mode="ANY", region_mode="ANY", positions=[0,1] ), # At least one fragment needs to overlap at least one region ),

This way we could incorporate it in query plans and wouldn't need to resort to pandas to get rid of duplicates.

Yeah, it might work this way :)

dmitrymyl · 2024-02-09T14:01:41Z

docs/query_engine_usage.md

+
+```python
+query_plan = [
+    Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),


I would use mode="ALL" for pileups. Besides, it would be a nice opportunity to show case this mode.

dmitrymyl · 2024-02-09T14:05:38Z

docs/query_engine_usage.md

+query_plan = [
+    Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),
+    RegionOffsetTransformation(),
+    OffsetAggregation('count'),


Also, it is unclear, over which columns we aggregate. Well, based on the class name, I can deduce that we aggregate over ["offset_1", "offset_2", "offset_3"] for triplets, but this should maybe be clarified somewhere. Also, is it possible to aggregate over a subset of offset columns?

Good point! My reasoning was that OffsetAggregation would always aggregate over all offset columns and we would have a separate aggregation for reducing along an offset dimension. But indeed, there is actually no difference other than the function we would use. I think also densify output would work for that task. I would then implement subsets!

dmitrymyl · 2024-02-09T14:22:14Z

spoc/query_engine.py


+class AggregationFunction(Enum):


Wouldn't it be too wordy to specify a non-default agg function in OffsetAggregtaion?

I would implement that as a separate type that can be passed as a function, for example, a basic implementation would be that we allow Union[AggregationFunction|str] for the function argument and if it's a string, we call .format with the value column. So if you want something like MIN(column), you would pass MIN({}).

Okay, that sounds reasonable for custom functions. I was a bit confused that if I want to specify a non-default function (like SUM), I will have to type AggregationFunction.SUM, which is too long. But then we can come up with an alias, like import torch.nn.functional as F.

For the case that you described, passing smth like MIN({}) seems to be a bit redundant -- we can attach ({}) under the hood.

Speaking of custom functions, we have two sorts of them: defined in duckdb and UDFs. Former should be simple to use, while I am curious how UDFs are defined and used in duckdb (since one can write a duckdb UDF in python, this seems to be a nice feature for the future).

Inded, UDFs would be nice in the future, but would leave that for now

Mittmich · 2024-02-11T22:15:16Z

Hi @dmitrymyl , thanks for the review! I addressed your comments :)

Regarding your points:

I rarely see enums in public API of scientific packages, so they feel inconvenient to me. I would design their functionality as str kwarg with nested ifs, but then there is a problem of repetitive code and unclear specifications.

I agree! I added the possibility to instantiate the classes with string arguments :)

There are no performance queries like setting numbers of cores and memory limit. These params could be kwargs for the Query class.

Good point, added a trello card: https://trello.com/c/gPAAy9Um/95-add-performance-setting-queries-to-query-class

Do we have support for the normalized data? Like "normalized_count" or smth.

Yes, one just needs to pass it as value_column for the OffsetAggregations.

RegionOffsetTransformation and OffsetAggregation are nouns, should we switch to verbs? But it will be lengthy anyway...

Hmm, indeed, but I would wait with renaming until we have a few example, because we may need to group things together again anyway.

Should we maybe make "stub" classes for Transform and Aggregate and inherit from them to establish the nomenclature of steps in the codebase?

I thought about it, but I wouldn't make stub classes until we have specific separation of behavior as at the moment they all have the same interface, namely the QueryStepProtocol. I actually think it's good to have more conceptual separation, but similar behavior in terms of code.

dmitrymyl

Hi @Mittmich, thanks for the new commits!

It looks good and well designed, but I got lost in namings and implementations of RegionOffsetTransformation and OffsetAggregation. They represent necessary blocks for the pileup procedure, so I tried to understand how the procedure should go.

ROIs are transformed outside of the query engine:
1. midpoint = (start + end) // 2
2. start = midpoint - windowsize
3. end = midpoint + windowsize
These transformed ROIs are submitted to Overlap and they are overlapped with Pixels.
Offsets -- are distances from midpoints of ROIs to the specified coordinate of each pixel bin (start, end, midpoint, both -- defined in Offset enum), by default it's start.
RegionOffsetTransformation calculates those offsets.
OffsetAggregation binnifies offsets (implicitly), selects the window size (relying on the fact that the ROIs are windowed midpoints) and aggregates counts of pixels by binned offsets if they fall within the window.

Does this flow reflect the one implemented?

I have some questions and suggestions:

I got lost with the "Offset" term (thinking past to our previous implementation of triplet pileups). Maybe we could rename it to "Distance"?
I couldn't pinpoint the binning of offsets — is it not present in the code and done implicitly relying on the fact that pixels are binned already?
Maybe we could instead introduce an operation OffsetBinningTransformation, that converts calculated offsets in nts into bins relative to midpoints of ROIs? This operation would be placed between RegionOffsetTransformation and OffsetAggregation thus removing implicit casting of real offsets into binned offsets (which relies on binning of pixels for now). Then RegionOffsetTransformation just calculates distances from fragments to midpoints of ROIs (obtained after Overlap with windowed ROIs) and could be applied to both pixels and contacts, the logic of casting real offsets into binned offsets is implemented in OffsetBinningTransformation, and aggregation in OffsetAggregation.
I haven't parsed how exactly RegionOffsetTransformation calculates offsets (=distances); but can it be implemented in Overlap as well similar to bioframe and bedtools?
I also think that it might be better to pass the window size explicitly somewhere. Like, for example, it can be OffsetWindowFilter or just OffsetFilter right after the OffsetAggregation to filter out rows where offsets are outside of (-windowsize, +windowsize).

I am curious what you think about it 🙂

spoc/query_engine.py

dmitrymyl · 2024-02-12T13:25:56Z

spoc/query_engine.py

+        # get offset columns
+        offset_columns = [f"offset_{i}" for i in position_fields.keys()]
+        # construct join and coalesce output
+        data_frame = (


That's a bit confusing query (because I am not familiar with project and COALESCE) — I will parse it later.

spoc/query_engine.py

dmitrymyl · 2024-02-12T14:56:39Z

spoc/query_engine.py

+        )
+
+
+class OffsetMode(Enum):


If regions submitted to RegionOffsetTransformation already have been extended +- window size, why would a user need to calculate different offsets except for MIDPOINT?

Are there other potential use cases?

Aaah, the OffsetMode is for fragments, not for regions! Okay, this works.

spoc/query_engine.py

dmitrymyl · 2024-02-12T15:54:46Z

Regarding your replies to my replies for the first portion of the review, @Mittmich, everything sounds great, I agree with your points!

I have some other feature requests:

Can we implement a .to_sql method in QueryPlan? This would be very useful for development and debugging later.
If we pursue implementing pileups, it would be very nice to have an option of strand-specific pileups, i.e. before aggregation flip the sign of offset coordinates for those pixels that have a region with "-" strand. This can be implemented as a separate basic block, but then RegionSchema should be able to take "strand" (or another user-defined field) as an optional column. We can implement it later anyway, not in this PR, since we have enough stuff to deal with for now.

Mittmich · 2024-02-18T13:05:22Z

Hey @dmitrymyl ,
Thanks for the thorough review! Regarding the more general points:

Does this flow reflect the one implemented?

Indeed, that flow is the one implemented, with one exception: OffsetAggregation does not perform binning unless this is done with the offset columns. I agree however that this is confusing behavior and would propose the following:

OffsetAggregation only operates on pixels and explicitly checks whether the passed data are pixels (This is done implicitly at the moment as contacts don't have a value column).
This would mean that to aggregate contacts by offset, we would first need to make pixels out of them, which to me seems fair enough.

I think this will also answer a lot of the code comments that you made.

I got lost with the "Offset" term (thinking past to our previous implementation of triplet pileups). Maybe we could rename it to "Distance"?

Distance sounds good, will rename it.

I couldn't pinpoint the binning of offsets — is it not present in the code and done implicitly relying on the fact that pixels are binned already?

Indeed, see the comment above.

Maybe we could instead introduce an operation OffsetBinningTransformation, that converts calculated offsets in nts into bins relative to midpoints of ROIs? This operation would be placed between RegionOffsetTransformation and OffsetAggregation thus removing implicit casting of real offsets into binned offsets (which relies on binning of pixels for now). Then RegionOffsetTransformation just calculates distances from fragments to midpoints of ROIs (obtained after Overlap with windowed ROIs) and could be applied to both pixels and contacts, the logic of casting real offsets into binned offsets is implemented in OffsetBinningTransformation, and aggregation in OffsetAggregation.

Eventually, that would be great, but I think currently, the flow would be to have clients of the code go through creating pixels.

haven't parsed how exactly RegionOffsetTransformation calculates offsets (=distances); but can it be implemented in Overlap as well similar to bioframe and bedtools?

It could be, but I like the separation and it's a nice example of a transformation, I think.

I also think that it might be better to pass the window size explicitly somewhere. Like, for example, it can be OffsetWindowFilter or just OffsetFilter right after the OffsetAggregation to filter out rows where offsets are outside of (-windowsize, +windowsize).

Yes, implicit windowsize handling is a problem. How about one can optionally pass window sizes and then regions are expanded to that size? And then it can be stored in the schema so we don't really on implicit calculations.

Mittmich · 2024-02-20T19:47:03Z

Hey! I implemented all the suggestions above, except for the rename to Distance, I thought about it some more and I think offset is more intuitive. We could name the thing we historically called offset also Anchor? But I don't feel super strongly about it, happy to be convinced!

dmitrymyl

Thanks for the quick changes @Mittmich! A couple of my comments are in the code and below:

Let's fix OffsetAggregation and the whole Pileup flow for Pixels only for now, maybe we will change it later.
Offset vs Distance: if we treat the query engine as bioframe/bedtools analogue, than we'd rather stick to the distance word, since it is the one used by these packages and offset is not found in their docs. I agree that distance is usually a non-negative metric, while here we have it otherwise, but offset might be even more confusing than that.
Windows constructed on the go from supplied regions — it works, let's leave it like that for now. I was thinking about outsourcing the creation of windows into a complex analytical procedure, i.e. we would have basic building blocks in the query engine and then a wrapper method/class that transforms Regions into windows. This still doesn't help with the fact that OffsetTransformation calculates distances to the midpoints of windows, which kinda leaks assumptions from an analytical procedure into the query engine, but this is okay for now.

Otherwise, besides some naming conventions, I would be happy to approve this PR to try the query engine in the wild and plan the next directions :)

spoc/models/dataframe_models.py

spoc/query_engine.py

Mittmich · 2024-02-21T21:22:35Z

Thanks for the input, @dmitrymyl! I renamed offset to distance everywhere :) Let's merge and start using spoc!

dmitrymyl

Alles Gute! Thanks, @Mittmich :)

Mittmich added 17 commits November 19, 2023 15:12

added basic interface for query engine

8e6032f

Added explanations

3e8c4f1

added query step protocol to query engine interface

942e946

Update duckdb version to 0.9.1

28fad15

Add support for DuckDBPyRelation in dataframe

6992d26

models and query engine

added examples to query engine interface

96cf7f7

Refactor RegionFilter to Snipper

bf5c53e

Add GenomicDataSchema protocol and implement

e965eeb

position field and contact order methods

added single region selection implementation

d990363

Add anchor validation in Snipper class

c06a7f4

blackify

a45b1e5

Refactor code and fix imports

5d5347f

pixels snipping tests passing

7d9cfd6

Refactor code formatting

f056b3a

pylint and black

bef26e6

added documentation for snipping

a16f6e6

updated documentation

dac123f

Mittmich requested review from cchlanger and dmitrymyl December 27, 2023 12:49

Mittmich added 10 commits December 28, 2023 18:20

renamed region columns

9c335d2

untested offset transformation

17f8ad8

regionoffset working

59b9765

Remove unused imports and fixtures

fb58f9b

Refactor imports and fix formatting issues

8f3be96

Update imports and ignore additional flake8 warnings

e1376b5

Add tests for offset aggregation functions in query engine

540fc43

added additional test for column name

be90c4f

added tests for aggregation on dense input

fb543ed

added addiitonal tests for offsetaggregation

8add40d

dmitrymyl requested changes Feb 4, 2024

View reviewed changes

Mittmich added 2 commits February 4, 2024 15:56

implemented PR changes

cf0672e

added duckdb parquet reader function

a073331

dmitrymyl requested changes Feb 4, 2024

View reviewed changes

spoc/io.py Outdated Show resolved Hide resolved

added check for data mode

93aeb22

dmitrymyl requested changes Feb 4, 2024

View reviewed changes

Mittmich added 3 commits February 4, 2024 21:48

Merge branch 'duckdb-transform-aggregation' into duckdb-query-engine

92b24f1

added case whne regions are nested

b738f8e

renamed BasicQuery, QueryResult and Snipper

c58a7b2

dmitrymyl requested changes Feb 9, 2024

View reviewed changes

Mittmich added 4 commits February 11, 2024 10:32

fixed typos in query_engine_interface

e679062

code review comments for uery engine usage

da256e1

added string parsing for enums

0218326

added capability to specify a subset of positions for offsetaggregation

555e64e

dmitrymyl requested changes Feb 12, 2024

View reviewed changes

Mittmich added 2 commits February 20, 2024 19:38

added explicit rejection of contacts to offsetaggregation

a5c60b1

added expliict windowsize handling

2582252

dmitrymyl requested changes Feb 21, 2024

View reviewed changes

spoc/models/dataframe_models.py Outdated Show resolved Hide resolved

spoc/models/dataframe_models.py Outdated Show resolved Hide resolved

spoc/query_engine.py Outdated Show resolved Hide resolved

spoc/query_engine.py Show resolved Hide resolved

Mittmich added 2 commits February 21, 2024 21:55

renamed window size

6b2ad66

Renamed offset to distance

94c8825

dmitrymyl approved these changes Feb 22, 2024

View reviewed changes

dmitrymyl merged commit 1944acb into master Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duckdb query engine #10

Duckdb query engine #10

Mittmich commented Dec 27, 2023

dmitrymyl commented Jan 12, 2024

Mittmich commented Feb 4, 2024 •

edited

Loading

dmitrymyl left a comment

dmitrymyl left a comment

dmitrymyl commented Feb 4, 2024

dmitrymyl left a comment •

edited

Loading

Mittmich commented Feb 4, 2024

dmitrymyl left a comment

dmitrymyl Feb 9, 2024

Mittmich Feb 11, 2024

dmitrymyl Feb 12, 2024

dmitrymyl Feb 9, 2024

Mittmich Feb 11, 2024

dmitrymyl Feb 12, 2024

dmitrymyl Feb 9, 2024

dmitrymyl Feb 9, 2024

Mittmich Feb 11, 2024

dmitrymyl Feb 9, 2024

Mittmich Feb 11, 2024 •

edited

Loading

dmitrymyl Feb 12, 2024

Mittmich Feb 20, 2024

Mittmich commented Feb 11, 2024

dmitrymyl left a comment

dmitrymyl Feb 12, 2024

dmitrymyl Feb 12, 2024

dmitrymyl Feb 12, 2024

dmitrymyl Feb 12, 2024

dmitrymyl commented Feb 12, 2024

Mittmich commented Feb 18, 2024 •

edited

Loading

Mittmich commented Feb 20, 2024

dmitrymyl left a comment

Mittmich commented Feb 21, 2024

dmitrymyl left a comment

		@@ -349,3 +349,386 @@ BasicQuery(query_plan=query_plan)\
		In this example, the contact overlapping both regions is duplicated.

Duckdb query engine #10

Duckdb query engine #10

Conversation

Mittmich commented Dec 27, 2023

dmitrymyl commented Jan 12, 2024

Mittmich commented Feb 4, 2024 • edited Loading

Naming suggestions

Questions

Merging of branches

dmitrymyl left a comment

Choose a reason for hiding this comment

dmitrymyl left a comment

Choose a reason for hiding this comment

dmitrymyl commented Feb 4, 2024

Naming suggestions

Questions

Merging of branches

dmitrymyl left a comment • edited Loading

Choose a reason for hiding this comment

Mittmich commented Feb 4, 2024

dmitrymyl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mittmich Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mittmich commented Feb 11, 2024

dmitrymyl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitrymyl commented Feb 12, 2024

Mittmich commented Feb 18, 2024 • edited Loading

Mittmich commented Feb 20, 2024

dmitrymyl left a comment

Choose a reason for hiding this comment

Mittmich commented Feb 21, 2024

dmitrymyl left a comment

Choose a reason for hiding this comment

Mittmich commented Feb 4, 2024 •

edited

Loading

dmitrymyl left a comment •

edited

Loading

Mittmich Feb 11, 2024 •

edited

Loading

Mittmich commented Feb 18, 2024 •

edited

Loading