Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Support Percentile in PPL #2670

Closed
LantaoJin opened this issue May 14, 2024 · 3 comments
Closed

[RFC] Support Percentile in PPL #2670

LantaoJin opened this issue May 14, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@LantaoJin
Copy link
Member

LantaoJin commented May 14, 2024

Is your feature request related to a problem?
This RFC is a part of this issue: #44
Currently, PPL can not answer question with percentile function. For example

"source=opensearch_dashboards_sample_data_ecommerce| stats percentile<50>(taxless_total_price)"

returns Unsupported aggregation function

"{\n  \"error\": {\n    \"reason\": \"Invalid Query\",\n    \"details\": \"Unsupported aggregation function percentile\",\n    \"type\": \"SemanticCheckException\"\n  },\n  \"status\": 400\n}"

SQL query with percentiles function works as below because that it fallbacks to legacy engine which is not support PPL, so does current JOIN syntax.

SELECT percentiles(taxful_total_price) FROM opensearch_dashboards_sample_data_ecommerce

PERCENTILE is a common aggregate function which has a lot of cases needs in Visualization with PPL. Same feature request from community: opendistro-for-elasticsearch/sql#1093

What solution would you like?

Syntax in PPL

Option 1 (defined in current OpenSearchPPLParser.g4, but not implemented in code)

stats percentile<quantile>(aggField)
stats percentile_approx<quantile>(aggField)

Option 2 (more readable and widely used in OLAP engines)

stats percentile(aggField, quantile)
stats percentile_approx(aggField, quantile)
  • quantile: The bounds of quantile must be in [0, 100]

Syntax in SQL

Basic

PERCENTILE(functionArg, quantile)
PERCENTILE_APPROX(functionArg, quantile)

ANSI SQL (experimental)

PERCENTILE_CONT(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
PERCENTILE_DISC(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
  • Window function: adding OVER ([PARTITION BY expression])

For example, mainstream database supports percentile_cont and percentile_disc
Postgresql
https://www.postgresql.org/docs/9.4/functions-aggregate.html
RedShift
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html
Spark
https://issues.apache.org/jira/browse/SPARK-37691

Solution

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

What alternatives have you considered?
No, due to percentiles in legacy SQL engine couldn't work in PPL.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

@LantaoJin LantaoJin added enhancement New feature or request untriaged labels May 14, 2024
@LantaoJin
Copy link
Member Author

@LantaoJin
Copy link
Member Author

cc @dai-chen, @penghuo

@penghuo
Copy link
Collaborator

penghuo commented May 15, 2024

Option 2 (more readable and widely used in OLAP engines)

I perfer option2

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

Do u proposal add another aggregator? Currently, SQL aggregation framework executed query plan on coordination node (if it can not be push down to OpenSearch). In future, we want to leverage Spark aggregation framework, instead of re-invent on it.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

Is it possible to implement it in OpenSearch Core? and SQL/PPL can leverage it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants