Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: DH-18351: Add CumCountWhere() and RollingCountWhere() features to UpdateBy #6566

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Jan 15, 2025

Groovy Examples

table = emptyTable(1000).update("key=randomInt(0,10)", "intCol=randomInt(0,1000)")

// zero-key
t_summary = table.updateBy([
    CumCountWhere("running_gt_500", "intCol > 500"),
    RollingCountWhere(50, "windowed_gt_500", "intCol > 500"),
    ])

// bucketed
t_summary = table.updateBy([
    CumCountWhere("running_gt_500", "intCol > 500"),
    RollingCountWhere(50, "windowed_gt_500", "intCol > 500"),
    ], "key")

Python Examples

from deephaven import empty_table
from deephaven.updateby import cum_count_where, rolling_count_where_tick

table = empty_table(1000).update(["key=randomInt(0,10)", "intCol=randomInt(0,1000)"])

# zero-key
t_summary = table.update_by([
    cum_count_where(col="running_gt_500", filters="intCol > 500"),
    rolling_count_where_tick(rev_ticks=50, col="windowed_gt_500", filters="intCol > 500"),
    ])

# bucketed
t_summary_bucketed = table.update_by([
    cum_count_where(col="running_gt_500", filters="intCol > 500"),
    rolling_count_where_tick(rev_ticks=50, col="windowed_gt_500", filters="intCol > 500"),
    ], by="key")

Performance Notes

TL:DR Performance compares very well.

RollingCountWhere() has near identical performance to the comparison benchmarks (can be faster depending on the complexity of the filter. CumCountWhere() also compares well to Ema()but can't catch up to zero-key CumSum(), which is is remarkably fast.

Comparing CumCountWhere to CumSum and Ema:

120000000
avg of 2

ZeroKey
CumSum	137.36250
Ema	449.5528125
CumCountWhereConstant	475.9980005
CumCountWhereMatch	649.9689995
CumCountWhereRange	654.322250
CumCountWhereMultiple	695.4477915
CumCountWhereMultipleOr	704.900583

Bucketed - 250 buckets
CumSum	2979.1730005
Ema	3024.152458
CumCountWhereConstant	2569.7280835
CumCountWhereMatch	3031.6534795
CumCountWhereRange	3030.5433335
CumCountWhereMultiple	3052.597625
CumCountWhereMultipleOr	3059.911729

Bucketed - 640 buckets
CumSum	3827.299833
Ema	3880.2538125
CumCountWhereConstant	3416.4387715
CumCountWhereMatch	3906.691333
CumCountWhereRange	3902.3064375
CumCountWhereMultiple	3967.1584795
CumCountWhereMultipleOr	3925.0775205

Comparing RollingCountWhere to RollingCount and RollingSum:

120000000
avg of 2

ZeroKey
RollingCount	1511.7957295
RollingSum	1513.6013545
RollingCountWhereConstant	1403.2817915
RollingCountWhereMatch	1453.9323125
RollingCountWhereRange	1764.2137915
RollingCountWhereMultiple	1576.4896255
RollingCountWhereMultipleOr	1541.5631455

Bucketed - 250 buckets
RollingCount	3468.7696665
RollingSum	3326.047792
RollingCountWhereConstant	2858.677771
RollingCountWhereMatch	3327.958604
RollingCountWhereRange	3347.961083
RollingCountWhereMultiple	3429.413562
RollingCountWhereMultipleOr	3364.244104

Bucketed - 640 buckets
RollingCount	4310.4265835
RollingSum	4286.427479
RollingCountWhereConstant	3869.1892705
RollingCountWhereMatch	4333.8479375
RollingCountWhereRange	4269.3454375
RollingCountWhereMultiple	4290.0618545
RollingCountWhereMultipleOr	4346.8478535

@lbooker42 lbooker42 self-assigned this Jan 15, 2025
@lbooker42 lbooker42 added this to the 0.38.0 milestone Jan 15, 2025
Copy link
Contributor Author

@lbooker42 lbooker42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review

@@ -341,6 +341,21 @@ public void testAggCountWhere() {
assertEquals(6L, counts.get(0));
counts = ColumnVectors.ofLong(doubleCounted, "filter15");
assertEquals(6L, counts.get(0));

// Get a static set table for use in dynamic where filters (contains 0-3)
final QueryTable setTable = (QueryTable) TableTools.newTable(col("sym", 1, 2, 3));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that AggCountWhere didn't test DynamicWhereFilter in CI, corrected here.

py/server/deephaven/updateby.py Outdated Show resolved Hide resolved
@@ -230,6 +230,7 @@ static Count AggCount(String resultColumn) {
* values that pass the supplied {@code filters}.
*
* @param resultColumn The {@link Count#column() output column} name
* @param filters The filters to apply to the input columns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected missing param

@lbooker42 lbooker42 marked this pull request as ready for review January 15, 2025 21:05
Copy link
Contributor

@cpwright cpwright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does code coverage look like?

@@ -665,7 +665,8 @@ def agg_all_by(self, agg: Aggregation, by: Union[str, List[str]]) -> Table:
"""
return super(Table, self).agg_all_by(agg, by)

def update_by(self, ops: Union[UpdateByOperation, List[UpdateByOperation]], by: Union[str, List[str]]) -> Table:
def update_by(self, ops: Union[UpdateByOperation, List[UpdateByOperation]],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage of the None case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might also need to update the doc string for by with something like defaults to None, meaning...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpwright yes, tested in test_updateby.py for both client and server side.

py/client/pydeephaven/updateby.py Outdated Show resolved Hide resolved
py/client/pydeephaven/updateby.py Outdated Show resolved Hide resolved
filters (Union[str, Filter, List[str], List[Filter]], optional): the filter condition
expression(s) or Filter object(s)
rev_ticks (int): the look-behind window size (in rows/ticks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default matches formula, but is not listed as a default in the doc.

What ist he intention for = 0; =0 to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not have a default for rev_ticks, corrected.

For timed windows, have rev==fwd==0 means that the window will contain all rows with exactly matching (to the nanosecond) timestamps. For ticks, rev==fwd==0 is undefined, it means a zero-size window and will probably break in interesting ways depending on the operator.

TestHelper.assertWhereInt(actualIt, expectedIt, val -> val > 10 && val <= 50);
}

// Test on String column (representing all Object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having simple tests for Instant and Boolean are worthwhile in my experience, there is stuff that can go wrong with reinterpretation. This is maybe not necessary here though, because we might be covered by where testing. I go back and forth on this, because we do need to create fake tables in some circumstances or could have problems with the ChunkFilter not matching what the aggs read.

@lbooker42
Copy link
Contributor Author

lbooker42 commented Jan 17, 2025

Tested Boolean column sources to see if performance is affected by re-interpretation, mostly appears in CumSum:

120000000
avg of 5

ZeroKey
CumSum	193.7803336
Ema	466.611625
CumCountWhereConstant	610.498708
CumCountWhereMatch	852.9890834
CumCountWhereRange	880.0267332
CumCountWhereMultiple	947.401850
CumCountWhereMultipleOr	982.6095334
CumCountWhereBool	2721.6762666

Bucketed - 250 buckets
CumSum	3239.204075
Ema	3104.3760168
CumCountWhereConstant	2576.669950
CumCountWhereMatch	3147.9543166
CumCountWhereRange	3149.5482498
CumCountWhereMultiple	3189.350925
CumCountWhereMultipleOr	3207.7662418
CumCountWhereBool	3115.197800

Bucketed - 640 buckets
CumSum	4116.442700
Ema	4112.2882168
CumCountWhereConstant	3649.7651334
CumCountWhereMatch	4208.6860582
CumCountWhereRange	4190.8844164
CumCountWhereMultiple	4168.2542168
CumCountWhereMultipleOr	4156.806350
CumCountWhereBool	4521.3730166

@cpwright cpwright changed the title feat: Add CumCountWhere() and RollingCountWhere() features to UpdateBy feat: DH-18351: Add CumCountWhere() and RollingCountWhere() features to UpdateBy Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants