Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Eventstats in PPL #800

Merged
merged 5 commits into from
Oct 25, 2024

Conversation

LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Oct 22, 2024

Description

PPL eventstats command

Description

The eventstats command enriches your event data with calculated summary statistics. It operates by analyzing specified fields within your events, computing various statistical measures, and then appending these results as new fields to each original event.

Key aspects of eventstats:

  1. It performs calculations across the entire result set or within defined groups.
  2. The original events remain intact, with new fields added to contain the statistical results.
  3. The command is particularly useful for comparative analysis, identifying outliers, or providing additional context to individual events.

Difference between stats and eventstats

The stats and eventstats commands are both used for calculating statistics, but they have some key differences in how they operate and what they produce:

  • Output Format:
    • stats: Produces a summary table with only the calculated statistics.
    • eventstats: Adds the calculated statistics as new fields to the existing events, preserving the original data.
  • Event Retention:
    • stats: Reduces the result set to only the statistical summary, discarding individual events.
    • eventstats: Retains all original events and adds new fields with the calculated statistics.
  • Use Cases:
    • stats: Best for creating summary reports or dashboards. Often used as a final command to summarize results.
    • eventstats: Useful when you need to enrich events with statistical context for further analysis or filtering. Can be used mid-search to add statistics that can be used in subsequent commands.

Syntax

eventstats <aggregation>... [by-clause]

(check "docs/ppl-lang/ppl-eventstats-command.md" for details)

Event Aggregations

See additional command details

  • source = table | eventstats avg(a)
  • source = table | where a < 50 | eventstats avg(c)
  • source = table | eventstats max(c) by b
  • source = table | eventstats count(c) by b | head 5
  • source = table | eventstats stddev_samp(c)
  • source = table | eventstats stddev_pop(c)
  • source = table | eventstats percentile(c, 90)
  • source = table | eventstats percentile_approx(c, 99)

Limitation: distinct aggregation could not used in eventstats:_

  • source = table | eventstats distinct_count(c) (throw exception)

Aggregations With Span

  • source = table | eventstats count(a) by span(a, 10) as a_span
  • source = table | eventstats sum(age) by span(age, 5) as age_span | head 2
  • source = table | eventstats avg(age) by span(age, 20) as age_span, country | sort - age_span | head 2

Aggregations With TimeWindow Span (tumble windowing function)

  • source = table | eventstats sum(productsAmount) by span(transactionDate, 1d) as age_date | sort age_date
  • source = table | eventstats sum(productsAmount) by span(transactionDate, 1w) as age_date, productId

Aggregations Group by Multiple Times

  • source = table | eventstats avg(age) as avg_state_age by country, state | eventstats avg(avg_state_age) as avg_country_age by country
  • source = table | eventstats avg(age) as avg_city_age by country, state, city | eval new_avg_city_age = avg_city_age - 1 | eventstats avg(new_avg_city_age) as avg_state_age by country, state | where avg_state_age > 18 | eventstats avg(avg_state_age) as avg_adult_country_age by country

Related Issues

Resolves #660

Check List

  • Updated documentation (docs/ppl-lang/README.md)
  • Implemented unit tests
  • Implemented tests for combination with other commands
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@LantaoJin LantaoJin added Lang:PPL Pipe Processing Language support 0.6 labels Oct 22, 2024
Signed-off-by: Lantao Jin <[email protected]>
@LantaoJin LantaoJin marked this pull request as ready for review October 22, 2024 13:36
- `source = table | eventstats percentile(c, 90)`
- `source = table | eventstats percentile_approx(c, 99)`

**Limitation: distinct aggregation could not used in `eventstats`:**_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add it cant be used in conjunction with stats - probably obvious but still need to be noted...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This limitation is only for the new command eventstats, not for stats. If we add a limitation note for conjunction with stats, similar, a limitation note for conjunction with every other commands would be considered. Would it be gilding the lily?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok make sense...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an excellent document - can u plz create a similar doc for the stats command ?
in a new PR ...
thanks!!

Copy link
Member Author

@LantaoJin LantaoJin Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats command already had one https://github.com/opensearch-project/opensearch-spark/blob/main/docs/ppl-lang/ppl-stats-command.md. And stats command is more straightforward, we even have a website doc https://opensearch.org/docs/latest/search-plugins/sql/ppl/functions/#stats to introduce what it is.

@LantaoJin LantaoJin requested a review from YANG-DB October 24, 2024 02:46
@LantaoJin LantaoJin enabled auto-merge (squash) October 24, 2024 06:58
Key aspects of `eventstats`:

1. It performs calculations across the entire result set or within defined groups.
2. The original events remain intact, with new fields added to contain the statistical results.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases I would only want to see a subset of fields with the enriched aggregation - should we allow adding the fields command after ?

Copy link
Member Author

@LantaoJin LantaoJin Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases I would only want to see a subset of fields with the enriched aggregation - should we allow adding the fields command after ?

Yes. This command only enriches the row with new columns (depends on how many aggregation used in eventstats), it allows to add fields command (any other type of commands) after it with | symbol.

@noCharger noCharger disabled auto-merge October 25, 2024 17:25
@seankao-az seankao-az merged commit 7bc0927 into opensearch-project:main Oct 25, 2024
4 checks passed
kenrickyap pushed a commit to Bit-Quill/opensearch-spark that referenced this pull request Dec 11, 2024
* Support Eventstats in PPL

Signed-off-by: Lantao Jin <[email protected]>

* add doc

Signed-off-by: Lantao Jin <[email protected]>

---------

Signed-off-by: Lantao Jin <[email protected]>
Co-authored-by: YANGDB <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 Lang:PPL Pipe Processing Language support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE]Extend ppl stats command functionality
3 participants