-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duckdb query engine #10
Conversation
models and query engine
position field and contact order methods
Very much thanks @Mittmich for the pull request! I am looking at it now and will split my review in two parts: concepts and code. |
Hey @dmitrymyl, Naming suggestions
I totally agree with this! This also ties into the naming of
I agree with this and will rename.
Sounds good! I already started to implement some of this in the
Yes, I also think that snipping is jargon and we should find something else. I don't like
Yeah, this is meant to subset rows. We could also capture this by implementing a Questions
This is a method that returns the position coordinates from a GenomicDataSchema. This allows to validate schemas during construction of queries.
Hmm, at least for the realization relationships, the arrow points towards the class that realizes a protocol (https://www.ibm.com/docs/en/dma?topic=diagrams-realization-relationships), so I think that should be fine?
Ah yes, maybe we can also rename the
The
Hmm, that is a good question! I am not sure whether this is the case... And the steps do have access to the schema that all the query steps produce as building the query plan involves pre-building the schema. Merging of branchesI think it's great that you already looked at the code in the Regarding your comments: I can't really see them, unfortunately, did you add them here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, with this comment I try to make my review comments visible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I added a couple of comments here as well.
Hey @Mittmich, thanks for the replies and quick fixes! Naming suggestions
Sounds very good!
Great
Perfect
Questions
Got it
I think that in the link you sent the arrow points to the class that specifies the protocol...
That actually sounds cool!
Okay, I got it.
Okay, let's think that this is not the case. But maybe we should write that down somewhere to not forget this possibility. Merging of branches
Yeah, that'll be convenient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Alright, I merged the other branch and renamed the classes as discussed :) Would be great if you could also have a look at the other functionality (Described in the query_engine_usage.ipynb notebook)! Thanks for the help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mittmich I checked new code except for tests and implementations of RegionOffsetTransformation and OffsetAggregation (will do later). I added my comments to code and some general notes below:
Now the query engine emerges as a powerful tool! I think now we have a good example of how we can use it (with pileups). I am curious how easy it will be to add new steps to it -- maybe some contribution docs in the (not so near) future?
- I rarely see enums in public API of scientific packages, so they feel inconvenient to me. I would design their functionality as str kwarg with nested ifs, but then there is a problem of repetitive code and unclear specifications.
- There are no performance queries like setting numbers of cores and memory limit. These params could be kwargs for the Query class.
- Offset coordinates are added separately, interesting implementation, but makes sense for downstream steps.
- Do we have support for the normalized data? Like
"normalized_count"
or smth. - RegionOffsetTransformation and OffsetAggregation are nouns, should we switch to verbs? But it will be lengthy anyway...
- Should we maybe make "stub" classes for Transform and Aggregate and inherit from them to establish the nomenclature of steps in the codebase?
I feel like I should update the nomenclature of query steps in my doc in Teams so that we could check which steps we have already implemented and which are left.
Otherwise looks great, thanks as usual!
from typing import List, Optional, Dict | ||
import pandas as pd | ||
|
||
from itertools import permutations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for splitting imports? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done by reorder-python-imports in the pre-commit hooks. It separates builtin from third-party from imports in this repo. It's very handy because it reorders imports automatically :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I meant that before there were multiple objects imported per line (List, Optional, Dict), and now they are imported in individual lines — but if that's an automatic behaviour, then it doesn't matter :)
@@ -349,3 +349,386 @@ BasicQuery(query_plan=query_plan)\ | |||
In this example, the contact overlapping both regions is duplicated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to region IDs, right? I am curious, if we would need a type of output when there is no regions and we need only unique contacts. This can easily be done with pandas (drop columns, drop duplicates), so I don't think we have to implement that in the query engine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it is because the there are regions that are overlapping and we therefore report contacts that fall within both regions twice. I think it is a good idea to determine whether we want that behavior. One could imagine to adding it as a parameter to the Overlap
class, perhaps as we described in the concept for multioverlap:
MultiOverlap(
regions=[loop_bases_left, loop_bases_right], # Regions to overlap
add_overlap_columns=False, # Whether to add the regions as additional columns
anchor=MultiAnchor(
fragment_mode="ANY",
region_mode="ANY", positions=[0,1]
), # At least one fragment needs to overlap at least one region
),
This way we could incorporate it in query plans and wouldn't need to resort to pandas to get rid of duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it might work this way :)
docs/query_engine_usage.md
Outdated
|
||
```python | ||
query_plan = [ | ||
Overlap(target_regions, anchor_mode=Anchor(mode="ANY")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use mode="ALL"
for pileups. Besides, it would be a nice opportunity to show case this mode.
docs/query_engine_usage.md
Outdated
query_plan = [ | ||
Overlap(target_regions, anchor_mode=Anchor(mode="ANY")), | ||
RegionOffsetTransformation(), | ||
OffsetAggregation('count'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it is unclear, over which columns we aggregate. Well, based on the class name, I can deduce that we aggregate over ["offset_1", "offset_2", "offset_3"]
for triplets, but this should maybe be clarified somewhere. Also, is it possible to aggregate over a subset of offset columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! My reasoning was that OffsetAggregation
would always aggregate over all offset columns and we would have a separate aggregation for reducing along an offset dimension. But indeed, there is actually no difference other than the function we would use. I think also densify output would work for that task. I would then implement subsets!
|
||
class AggregationFunction(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be too wordy to specify a non-default agg function in OffsetAggregtaion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would implement that as a separate type that can be passed as a function, for example, a basic implementation would be that we allow Union[AggregationFunction|str]
for the function argument and if it's a string, we call .format
with the value column. So if you want something like MIN(column)
, you would pass MIN({})
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds reasonable for custom functions. I was a bit confused that if I want to specify a non-default function (like SUM), I will have to type AggregationFunction.SUM
, which is too long. But then we can come up with an alias, like import torch.nn.functional as F
.
For the case that you described, passing smth like MIN({})
seems to be a bit redundant -- we can attach ({})
under the hood.
Speaking of custom functions, we have two sorts of them: defined in duckdb and UDFs. Former should be simple to use, while I am curious how UDFs are defined and used in duckdb (since one can write a duckdb UDF in python, this seems to be a nice feature for the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inded, UDFs would be nice in the future, but would leave that for now
Hi @dmitrymyl , thanks for the review! I addressed your comments :) Regarding your points:
I agree! I added the possibility to instantiate the classes with string arguments :)
Good point, added a trello card: https://trello.com/c/gPAAy9Um/95-add-performance-setting-queries-to-query-class
Yes, one just needs to pass it as
Hmm, indeed, but I would wait with renaming until we have a few example, because we may need to group things together again anyway.
I thought about it, but I wouldn't make stub classes until we have specific separation of behavior as at the moment they all have the same interface, namely the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Mittmich, thanks for the new commits!
It looks good and well designed, but I got lost in namings and implementations of RegionOffsetTransformation and OffsetAggregation. They represent necessary blocks for the pileup procedure, so I tried to understand how the procedure should go.
- ROIs are transformed outside of the query engine:
- midpoint = (start + end) // 2
- start = midpoint - windowsize
- end = midpoint + windowsize
- These transformed ROIs are submitted to
Overlap
and they are overlapped withPixels
. - Offsets -- are distances from midpoints of ROIs to the specified coordinate of each pixel bin (start, end, midpoint, both -- defined in Offset enum), by default it's start.
RegionOffsetTransformation
calculates those offsets.OffsetAggregation
binnifies offsets (implicitly), selects the window size (relying on the fact that the ROIs are windowed midpoints) and aggregates counts of pixels by binned offsets if they fall within the window.
Does this flow reflect the one implemented?
I have some questions and suggestions:
- I got lost with the "Offset" term (thinking past to our previous implementation of triplet pileups). Maybe we could rename it to "Distance"?
- I couldn't pinpoint the binning of offsets — is it not present in the code and done implicitly relying on the fact that pixels are binned already?
- Maybe we could instead introduce an operation
OffsetBinningTransformation
, that converts calculated offsets in nts into bins relative to midpoints of ROIs? This operation would be placed betweenRegionOffsetTransformation
andOffsetAggregation
thus removing implicit casting of real offsets into binned offsets (which relies on binning of pixels for now). ThenRegionOffsetTransformation
just calculates distances from fragments to midpoints of ROIs (obtained after Overlap with windowed ROIs) and could be applied to both pixels and contacts, the logic of casting real offsets into binned offsets is implemented inOffsetBinningTransformation
, and aggregation inOffsetAggregation
. - I haven't parsed how exactly
RegionOffsetTransformation
calculates offsets (=distances); but can it be implemented inOverlap
as well similar to bioframe and bedtools? - I also think that it might be better to pass the window size explicitly somewhere. Like, for example, it can be
OffsetWindowFilter
or justOffsetFilter
right after theOffsetAggregation
to filter out rows where offsets are outside of (-windowsize, +windowsize).
I am curious what you think about it 🙂
# get offset columns | ||
offset_columns = [f"offset_{i}" for i in position_fields.keys()] | ||
# construct join and coalesce output | ||
data_frame = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit confusing query (because I am not familiar with project
and COALESCE
) — I will parse it later.
spoc/query_engine.py
Outdated
) | ||
|
||
|
||
class OffsetMode(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If regions submitted to RegionOffsetTransformation already have been extended +- window size, why would a user need to calculate different offsets except for MIDPOINT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other potential use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah, the OffsetMode is for fragments, not for regions! Okay, this works.
Regarding your replies to my replies for the first portion of the review, @Mittmich, everything sounds great, I agree with your points! I have some other feature requests:
|
Hey @dmitrymyl ,
Indeed, that flow is the one implemented, with one exception:
I think this will also answer a lot of the code comments that you made.
Distance sounds good, will rename it.
Indeed, see the comment above.
Eventually, that would be great, but I think currently, the flow would be to have clients of the code go through creating pixels.
It could be, but I like the separation and it's a nice example of a transformation, I think.
Yes, implicit windowsize handling is a problem. How about one can optionally pass window sizes and then regions are expanded to that size? And then it can be stored in the schema so we don't really on implicit calculations. |
Hey! I implemented all the suggestions above, except for the rename to Distance, I thought about it some more and I think offset is more intuitive. We could name the thing we historically called offset also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick changes @Mittmich! A couple of my comments are in the code and below:
-
Let's fix
OffsetAggregation
and the whole Pileup flow for Pixels only for now, maybe we will change it later. -
Offset vs Distance: if we treat the query engine as bioframe/bedtools analogue, than we'd rather stick to the distance word, since it is the one used by these packages and offset is not found in their docs. I agree that distance is usually a non-negative metric, while here we have it otherwise, but offset might be even more confusing than that.
-
Windows constructed on the go from supplied regions — it works, let's leave it like that for now. I was thinking about outsourcing the creation of windows into a complex analytical procedure, i.e. we would have basic building blocks in the query engine and then a wrapper method/class that transforms Regions into windows. This still doesn't help with the fact that
OffsetTransformation
calculates distances to the midpoints of windows, which kinda leaks assumptions from an analytical procedure into the query engine, but this is okay for now.
Otherwise, besides some naming conventions, I would be happy to approve this PR to try the query engine in the wild and plan the next directions :)
Thanks for the input, @dmitrymyl! I renamed offset to distance everywhere :) Let's merge and start using spoc! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alles Gute! Thanks, @Mittmich :)
This PR implements this ticket. It suggests the class structure of the query engine (can be found in docs\query_engine_interface.md) and provides and implementation using duckdb. A good place to start the review is the usage documentation that can be found here: docs\query_engine_usage.md. Looking forward to your comment!