Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duckdb query engine #10

Merged
merged 45 commits into from
Feb 22, 2024
Merged

Duckdb query engine #10

merged 45 commits into from
Feb 22, 2024

Conversation

Mittmich
Copy link
Contributor

This PR implements this ticket. It suggests the class structure of the query engine (can be found in docs\query_engine_interface.md) and provides and implementation using duckdb. A good place to start the review is the usage documentation that can be found here: docs\query_engine_usage.md. Looking forward to your comment!

@dmitrymyl
Copy link
Contributor

Very much thanks @Mittmich for the pull request! I am looking at it now and will split my review in two parts: concepts and code.

@Mittmich
Copy link
Contributor Author

Mittmich commented Feb 4, 2024

Hey @dmitrymyl,
thanks for the thorough review! Regarding your points:

Naming suggestions

  1. query = BasicQuery([...]); result = query.query(dataset). query.query sounds redundant. Maybe we should just implement call to make it query(dataset)? Although, this stuff is a bit advanced python feature, less experienced users might not be familiar with it and might find it confusing.

I totally agree with this! This also ties into the naming of BasicQuery since it's a very fundamental object. I would be in favor of naming it Query and having a build method that is equivalent to the current query method?

2 .result.compute() sounds better and in line with dask than result.load_result().

I agree with this and will rename.

  1. Snipper, Aggregation, Transformation — imo verbs would be shorter to type and they are also in line with SQL syntax (thus helping with adaption to our query engine). So I suggest to rename all QueryStep into verbs like Snip, Aggregate,

Sounds good! I already started to implement some of this in the duckdb-transform-aggregation branch, so this fits well.

  1. Snipping naming

Yes, I also think that snipping is jargon and we should find something else. I don't like Select that much as - as you say - has another meaning in SQL. I like overlap as a name. I would then also merge the single and multiregion capabilities into one class and make the behavior dependent on the passed input.

  1. In the docs, Filter along the rows — I have always found this terminology confusing in pandas. Does it subset rows or subset columns? I suspect the former. It could also be rewritten from the perspective of contacts and individual fragments, but this is not crucial ATM.

Yeah, this is meant to subset rows. We could also capture this by implementing a RegionOffsetOverlap to allow overlapping with relative regions that have been added already. Then we wouldn't need to talk about rows or columns, but concepts.

Questions

What are get_position_coordinates() of gDataSchema? It was not described in the UML diagram (maybe it will be clear after I check the code).

This is a method that returns the position coordinates from a GenomicDataSchema. This allows to validate schemas during construction of queries.

Mermaid diagram: should all arrows be in reverse? It's the case with UML diagrams, but maybe syntax is different here. Anyway, it doesn't matter for the project per se.

Hmm, at least for the realization relationships, the arrow points towards the class that realizes a protocol (https://www.ibm.com/docs/en/dma?topic=diagrams-realization-relationships), so I think that should be fine?

It was a bit confusing that QueryResult is an output of the BasicQuery and implements the gDataProtocol at the same time. Initially, I was thinking that the QueryResult is an actual materialization, like a 2D pileup matrix. But it is actually a lazy evaluated result, not the data. Then it makes sense to say it implements the gDataProtocol.

Ah yes, maybe we can also rename the QueryResult to something like QueryPlan? Then we can rename the argument query_plan of BasicQuery (which we will rename as Query) to something like query_steps. Then the Query object will get a list of query steps and it's build method generates a QueryPlan, which has a compute method. What do you think?

However, I am still not sure that the QueryResult should implement the gDataProtocol. It makes sense since the BasicQuery can accept the QueryResult. But what if the QueryResult actually relates to smth not being Pixels or Contacts, like 2D pileup matrix? What sense is there for it to follow the gDataProtocol? As an alternative, QueryResult might or might not implement the gDataProtocol to accomodate for this ambiguity.

The GenomicDataProtocol is quite flexible as it only contains methods to get aspects of the schema of the data. Whether that schema is compatible with other query steps depends on the specific query step and their validation method. So in theory, we can have 2d pixels, but basically, no query steps could accept it as input, which is fine since it would be more of a result.

BasicQuery.query() builds the query plan for postponed execution. The query plan is specified in the class instance and is applied to a dataset. This implies that query steps are independent of the dataset. Can it be otherwise: that features of the dataset (shape, contact orders etc) dynamically dictate the specific choice of procedures (in a way that user doesn't know that beforehand and cannot write another query plan)?

Hmm, that is a good question! I am not sure whether this is the case... And the steps do have access to the schema that all the query steps produce as building the query plan involves pre-building the schema.

Merging of branches

I think it's great that you already looked at the code in the duckdb-transform-aggregation branch! What I would propose is that I merge that branch here and we discuss all your comments together. Otherwise we will have many discussions that need to refer to the other branch.

Regarding your comments: I can't really see them, unfortunately, did you add them here?

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, with this comment I try to make my review comments visible...

spoc/contacts.py Show resolved Hide resolved
spoc/io.py Outdated Show resolved Hide resolved
spoc/models/dataframe_models.py Show resolved Hide resolved
spoc/models/dataframe_models.py Outdated Show resolved Hide resolved
spoc/models/dataframe_models.py Show resolved Hide resolved
spoc/query_engine.py Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I added a couple of comments here as well.

spoc/io.py Outdated Show resolved Hide resolved
@dmitrymyl
Copy link
Contributor

Hey @Mittmich, thanks for the replies and quick fixes!

Naming suggestions

  1. query = BasicQuery([...]); result = query.query(dataset). query.query sounds redundant. Maybe we should just implement call to make it query(dataset)? Although, this stuff is a bit advanced python feature, less experienced users might not be familiar with it and might find it confusing.

I totally agree with this! This also ties into the naming of BasicQuery since it's a very fundamental object. I would be in favor of naming it Query and having a build method that is equivalent to the current query method?

Sounds very good!

2 .result.compute() sounds better and in line with dask than result.load_result().

I agree with this and will rename.

Great

  1. Snipper, Aggregation, Transformation — imo verbs would be shorter to type and they are also in line with SQL syntax (thus helping with adaption to our query engine). So I suggest to rename all QueryStep into verbs like Snip, Aggregate,

Sounds good! I already started to implement some of this in the duckdb-transform-aggregation branch, so this fits well.

Perfect

  1. Snipping naming

Yes, I also think that snipping is jargon and we should find something else. I don't like Select that much as - as you say - has another meaning in SQL. I like overlap as a name. I would then also merge the single and multiregion capabilities into one class and make the behavior dependent on the passed input.

Overlap and MultiOverlap sound good, I also like them.

  1. In the docs, Filter along the rows — I have always found this terminology confusing in pandas. Does it subset rows or subset columns? I suspect the former. It could also be rewritten from the perspective of contacts and individual fragments, but this is not crucial ATM.

Yeah, this is meant to subset rows. We could also capture this by implementing a RegionOffsetOverlap to allow overlapping with relative regions that have been added already. Then we wouldn't need to talk about rows or columns, but concepts.

RegionOffsetOverlap sounds interesting, this might be a nice feature, but let's think about it later :)

Questions

What are get_position_coordinates() of gDataSchema? It was not described in the UML diagram (maybe it will be clear after I check the code).

This is a method that returns the position coordinates from a GenomicDataSchema. This allows to validate schemas during construction of queries.

Got it

Mermaid diagram: should all arrows be in reverse? It's the case with UML diagrams, but maybe syntax is different here. Anyway, it doesn't matter for the project per se.

Hmm, at least for the realization relationships, the arrow points towards the class that realizes a protocol (https://www.ibm.com/docs/en/dma?topic=diagrams-realization-relationships), so I think that should be fine?

I think that in the link you sent the arrow points to the class that specifies the protocol...

It was a bit confusing that QueryResult is an output of the BasicQuery and implements the gDataProtocol at the same time. Initially, I was thinking that the QueryResult is an actual materialization, like a 2D pileup matrix. But it is actually a lazy evaluated result, not the data. Then it makes sense to say it implements the gDataProtocol.

Ah yes, maybe we can also rename the QueryResult to something like QueryPlan? Then we can rename the argument query_plan of BasicQuery (which we will rename as Query) to something like query_steps. Then the Query object will get a list of query steps and it's build method generates a QueryPlan, which has a compute method. What do you think?

That actually sounds cool!

However, I am still not sure that the QueryResult should implement the gDataProtocol. It makes sense since the BasicQuery can accept the QueryResult. But what if the QueryResult actually relates to smth not being Pixels or Contacts, like 2D pileup matrix? What sense is there for it to follow the gDataProtocol? As an alternative, QueryResult might or might not implement the gDataProtocol to accomodate for this ambiguity.

The GenomicDataProtocol is quite flexible as it only contains methods to get aspects of the schema of the data. Whether that schema is compatible with other query steps depends on the specific query step and their validation method. So in theory, we can have 2d pixels, but basically, no query steps could accept it as input, which is fine since it would be more of a result.

Okay, I got it.

BasicQuery.query() builds the query plan for postponed execution. The query plan is specified in the class instance and is applied to a dataset. This implies that query steps are independent of the dataset. Can it be otherwise: that features of the dataset (shape, contact orders etc) dynamically dictate the specific choice of procedures (in a way that user doesn't know that beforehand and cannot write another query plan)?

Hmm, that is a good question! I am not sure whether this is the case... And the steps do have access to the schema that all the query steps produce as building the query plan involves pre-building the schema.

Okay, let's think that this is not the case. But maybe we should write that down somewhere to not forget this possibility.

Merging of branches

I think it's great that you already looked at the code in the duckdb-transform-aggregation branch! What I would propose is that I merge that branch here and we discuss all your comments together. Otherwise we will have many discussions that need to refer to the other branch.

Yeah, that'll be convenient

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@Mittmich
Copy link
Contributor Author

Mittmich commented Feb 4, 2024

Alright, I merged the other branch and renamed the classes as discussed :) Would be great if you could also have a look at the other functionality (Described in the query_engine_usage.ipynb notebook)! Thanks for the help!

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mittmich I checked new code except for tests and implementations of RegionOffsetTransformation and OffsetAggregation (will do later). I added my comments to code and some general notes below:

Now the query engine emerges as a powerful tool! I think now we have a good example of how we can use it (with pileups). I am curious how easy it will be to add new steps to it -- maybe some contribution docs in the (not so near) future?

  1. I rarely see enums in public API of scientific packages, so they feel inconvenient to me. I would design their functionality as str kwarg with nested ifs, but then there is a problem of repetitive code and unclear specifications.
  2. There are no performance queries like setting numbers of cores and memory limit. These params could be kwargs for the Query class.
  3. Offset coordinates are added separately, interesting implementation, but makes sense for downstream steps.
  4. Do we have support for the normalized data? Like "normalized_count" or smth.
  5. RegionOffsetTransformation and OffsetAggregation are nouns, should we switch to verbs? But it will be lengthy anyway...
  6. Should we maybe make "stub" classes for Transform and Aggregate and inherit from them to establish the nomenclature of steps in the codebase?

I feel like I should update the nomenclature of query steps in my doc in Teams so that we could check which steps we have already implemented and which are left.

Otherwise looks great, thanks as usual!

from typing import List, Optional, Dict
import pandas as pd

from itertools import permutations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for splitting imports? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done by reorder-python-imports in the pre-commit hooks. It separates builtin from third-party from imports in this repo. It's very handy because it reorders imports automatically :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I meant that before there were multiple objects imported per line (List, Optional, Dict), and now they are imported in individual lines — but if that's an automatic behaviour, then it doesn't matter :)

spoc/contacts.py Show resolved Hide resolved
spoc/models/dataframe_models.py Show resolved Hide resolved
spoc/pixels.py Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
docs/query_engine_usage.md Outdated Show resolved Hide resolved
@@ -349,3 +349,386 @@ BasicQuery(query_plan=query_plan)\
In this example, the contact overlapping both regions is duplicated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to region IDs, right? I am curious, if we would need a type of output when there is no regions and we need only unique contacts. This can easily be done with pandas (drop columns, drop duplicates), so I don't think we have to implement that in the query engine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it is because the there are regions that are overlapping and we therefore report contacts that fall within both regions twice. I think it is a good idea to determine whether we want that behavior. One could imagine to adding it as a parameter to the Overlap class, perhaps as we described in the concept for multioverlap:

    MultiOverlap(
        regions=[loop_bases_left, loop_bases_right], # Regions to overlap
        add_overlap_columns=False, # Whether to add the regions as additional columns
        anchor=MultiAnchor(
            fragment_mode="ANY",
            region_mode="ANY", positions=[0,1]
        ), # At least one fragment needs to overlap at least one region
    ),

This way we could incorporate it in query plans and wouldn't need to resort to pandas to get rid of duplicates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might work this way :)


```python
query_plan = [
Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use mode="ALL" for pileups. Besides, it would be a nice opportunity to show case this mode.

query_plan = [
Overlap(target_regions, anchor_mode=Anchor(mode="ANY")),
RegionOffsetTransformation(),
OffsetAggregation('count'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it is unclear, over which columns we aggregate. Well, based on the class name, I can deduce that we aggregate over ["offset_1", "offset_2", "offset_3"] for triplets, but this should maybe be clarified somewhere. Also, is it possible to aggregate over a subset of offset columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! My reasoning was that OffsetAggregation would always aggregate over all offset columns and we would have a separate aggregation for reducing along an offset dimension. But indeed, there is actually no difference other than the function we would use. I think also densify output would work for that task. I would then implement subsets!


class AggregationFunction(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be too wordy to specify a non-default agg function in OffsetAggregtaion?

Copy link
Contributor Author

@Mittmich Mittmich Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would implement that as a separate type that can be passed as a function, for example, a basic implementation would be that we allow Union[AggregationFunction|str] for the function argument and if it's a string, we call .format with the value column. So if you want something like MIN(column), you would pass MIN({}).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds reasonable for custom functions. I was a bit confused that if I want to specify a non-default function (like SUM), I will have to type AggregationFunction.SUM, which is too long. But then we can come up with an alias, like import torch.nn.functional as F.

For the case that you described, passing smth like MIN({}) seems to be a bit redundant -- we can attach ({}) under the hood.

Speaking of custom functions, we have two sorts of them: defined in duckdb and UDFs. Former should be simple to use, while I am curious how UDFs are defined and used in duckdb (since one can write a duckdb UDF in python, this seems to be a nice feature for the future).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inded, UDFs would be nice in the future, but would leave that for now

@Mittmich
Copy link
Contributor Author

Hi @dmitrymyl , thanks for the review! I addressed your comments :)

Regarding your points:

  1. I rarely see enums in public API of scientific packages, so they feel inconvenient to me. I would design their functionality as str kwarg with nested ifs, but then there is a problem of repetitive code and unclear specifications.

I agree! I added the possibility to instantiate the classes with string arguments :)

  1. There are no performance queries like setting numbers of cores and memory limit. These params could be kwargs for the Query class.

Good point, added a trello card: https://trello.com/c/gPAAy9Um/95-add-performance-setting-queries-to-query-class

  1. Do we have support for the normalized data? Like "normalized_count" or smth.

Yes, one just needs to pass it as value_column for the OffsetAggregations.

  1. RegionOffsetTransformation and OffsetAggregation are nouns, should we switch to verbs? But it will be lengthy anyway...

Hmm, indeed, but I would wait with renaming until we have a few example, because we may need to group things together again anyway.

  1. Should we maybe make "stub" classes for Transform and Aggregate and inherit from them to establish the nomenclature of steps in the codebase?

I thought about it, but I wouldn't make stub classes until we have specific separation of behavior as at the moment they all have the same interface, namely the QueryStepProtocol. I actually think it's good to have more conceptual separation, but similar behavior in terms of code.

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Mittmich, thanks for the new commits!

It looks good and well designed, but I got lost in namings and implementations of RegionOffsetTransformation and OffsetAggregation. They represent necessary blocks for the pileup procedure, so I tried to understand how the procedure should go.

  1. ROIs are transformed outside of the query engine:
    1. midpoint = (start + end) // 2
    2. start = midpoint - windowsize
    3. end = midpoint + windowsize
  2. These transformed ROIs are submitted to Overlap and they are overlapped with Pixels.
  3. Offsets -- are distances from midpoints of ROIs to the specified coordinate of each pixel bin (start, end, midpoint, both -- defined in Offset enum), by default it's start.
  4. RegionOffsetTransformation calculates those offsets.
  5. OffsetAggregation binnifies offsets (implicitly), selects the window size (relying on the fact that the ROIs are windowed midpoints) and aggregates counts of pixels by binned offsets if they fall within the window.

Does this flow reflect the one implemented?

I have some questions and suggestions:

  1. I got lost with the "Offset" term (thinking past to our previous implementation of triplet pileups). Maybe we could rename it to "Distance"?
  2. I couldn't pinpoint the binning of offsets — is it not present in the code and done implicitly relying on the fact that pixels are binned already?
  3. Maybe we could instead introduce an operation OffsetBinningTransformation, that converts calculated offsets in nts into bins relative to midpoints of ROIs? This operation would be placed between RegionOffsetTransformation and OffsetAggregation thus removing implicit casting of real offsets into binned offsets (which relies on binning of pixels for now). Then RegionOffsetTransformation just calculates distances from fragments to midpoints of ROIs (obtained after Overlap with windowed ROIs) and could be applied to both pixels and contacts, the logic of casting real offsets into binned offsets is implemented in OffsetBinningTransformation, and aggregation in OffsetAggregation.
  4. I haven't parsed how exactly RegionOffsetTransformation calculates offsets (=distances); but can it be implemented in Overlap as well similar to bioframe and bedtools?
  5. I also think that it might be better to pass the window size explicitly somewhere. Like, for example, it can be OffsetWindowFilter or just OffsetFilter right after the OffsetAggregation to filter out rows where offsets are outside of (-windowsize, +windowsize).

I am curious what you think about it 🙂

spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Show resolved Hide resolved
# get offset columns
offset_columns = [f"offset_{i}" for i in position_fields.keys()]
# construct join and coalesce output
data_frame = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit confusing query (because I am not familiar with project and COALESCE) — I will parse it later.

spoc/query_engine.py Outdated Show resolved Hide resolved
)


class OffsetMode(Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If regions submitted to RegionOffsetTransformation already have been extended +- window size, why would a user need to calculate different offsets except for MIDPOINT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other potential use cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah, the OffsetMode is for fragments, not for regions! Okay, this works.

spoc/query_engine.py Outdated Show resolved Hide resolved
@dmitrymyl
Copy link
Contributor

Regarding your replies to my replies for the first portion of the review, @Mittmich, everything sounds great, I agree with your points!

I have some other feature requests:

  1. Can we implement a .to_sql method in QueryPlan? This would be very useful for development and debugging later.
  2. If we pursue implementing pileups, it would be very nice to have an option of strand-specific pileups, i.e. before aggregation flip the sign of offset coordinates for those pixels that have a region with "-" strand. This can be implemented as a separate basic block, but then RegionSchema should be able to take "strand" (or another user-defined field) as an optional column. We can implement it later anyway, not in this PR, since we have enough stuff to deal with for now.

@Mittmich
Copy link
Contributor Author

Mittmich commented Feb 18, 2024

Hey @dmitrymyl ,
Thanks for the thorough review! Regarding the more general points:

Does this flow reflect the one implemented?

Indeed, that flow is the one implemented, with one exception: OffsetAggregation does not perform binning unless this is done with the offset columns. I agree however that this is confusing behavior and would propose the following:

  • OffsetAggregation only operates on pixels and explicitly checks whether the passed data are pixels (This is done implicitly at the moment as contacts don't have a value column).
  • This would mean that to aggregate contacts by offset, we would first need to make pixels out of them, which to me seems fair enough.

I think this will also answer a lot of the code comments that you made.

  1. I got lost with the "Offset" term (thinking past to our previous implementation of triplet pileups). Maybe we could rename it to "Distance"?

Distance sounds good, will rename it.

  1. I couldn't pinpoint the binning of offsets — is it not present in the code and done implicitly relying on the fact that pixels are binned already?

Indeed, see the comment above.

  1. Maybe we could instead introduce an operation OffsetBinningTransformation, that converts calculated offsets in nts into bins relative to midpoints of ROIs? This operation would be placed between RegionOffsetTransformation and OffsetAggregation thus removing implicit casting of real offsets into binned offsets (which relies on binning of pixels for now). Then RegionOffsetTransformation just calculates distances from fragments to midpoints of ROIs (obtained after Overlap with windowed ROIs) and could be applied to both pixels and contacts, the logic of casting real offsets into binned offsets is implemented in OffsetBinningTransformation, and aggregation in OffsetAggregation.

Eventually, that would be great, but I think currently, the flow would be to have clients of the code go through creating pixels.

haven't parsed how exactly RegionOffsetTransformation calculates offsets (=distances); but can it be implemented in Overlap as well similar to bioframe and bedtools?

It could be, but I like the separation and it's a nice example of a transformation, I think.

I also think that it might be better to pass the window size explicitly somewhere. Like, for example, it can be OffsetWindowFilter or just OffsetFilter right after the OffsetAggregation to filter out rows where offsets are outside of (-windowsize, +windowsize).

Yes, implicit windowsize handling is a problem. How about one can optionally pass window sizes and then regions are expanded to that size? And then it can be stored in the schema so we don't really on implicit calculations.

@Mittmich
Copy link
Contributor Author

Hey! I implemented all the suggestions above, except for the rename to Distance, I thought about it some more and I think offset is more intuitive. We could name the thing we historically called offset also Anchor? But I don't feel super strongly about it, happy to be convinced!

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick changes @Mittmich! A couple of my comments are in the code and below:

  1. Let's fix OffsetAggregation and the whole Pileup flow for Pixels only for now, maybe we will change it later.

  2. Offset vs Distance: if we treat the query engine as bioframe/bedtools analogue, than we'd rather stick to the distance word, since it is the one used by these packages and offset is not found in their docs. I agree that distance is usually a non-negative metric, while here we have it otherwise, but offset might be even more confusing than that.

  3. Windows constructed on the go from supplied regions — it works, let's leave it like that for now. I was thinking about outsourcing the creation of windows into a complex analytical procedure, i.e. we would have basic building blocks in the query engine and then a wrapper method/class that transforms Regions into windows. This still doesn't help with the fact that OffsetTransformation calculates distances to the midpoints of windows, which kinda leaks assumptions from an analytical procedure into the query engine, but this is okay for now.

Otherwise, besides some naming conventions, I would be happy to approve this PR to try the query engine in the wild and plan the next directions :)

spoc/models/dataframe_models.py Outdated Show resolved Hide resolved
spoc/models/dataframe_models.py Outdated Show resolved Hide resolved
spoc/query_engine.py Outdated Show resolved Hide resolved
spoc/query_engine.py Show resolved Hide resolved
@Mittmich
Copy link
Contributor Author

Thanks for the input, @dmitrymyl! I renamed offset to distance everywhere :) Let's merge and start using spoc!

Copy link
Contributor

@dmitrymyl dmitrymyl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alles Gute! Thanks, @Mittmich :)

@dmitrymyl dmitrymyl merged commit 1944acb into master Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants