[RFC] Support Percentile in PPL #2670

LantaoJin · 2024-05-14T09:08:15Z

Is your feature request related to a problem?
This RFC is a part of this issue: #44
Currently, PPL can not answer question with percentile function. For example

"source=opensearch_dashboards_sample_data_ecommerce| stats percentile<50>(taxless_total_price)"

returns Unsupported aggregation function

"{\n  \"error\": {\n    \"reason\": \"Invalid Query\",\n    \"details\": \"Unsupported aggregation function percentile\",\n    \"type\": \"SemanticCheckException\"\n  },\n  \"status\": 400\n}"

SQL query with percentiles function works as below because that it fallbacks to legacy engine which is not support PPL, so does current JOIN syntax.

SELECT percentiles(taxful_total_price) FROM opensearch_dashboards_sample_data_ecommerce

PERCENTILE is a common aggregate function which has a lot of cases needs in Visualization with PPL. Same feature request from community: opendistro-for-elasticsearch/sql#1093

What solution would you like?

Syntax in PPL

Option 1 (defined in current OpenSearchPPLParser.g4, but not implemented in code)

stats percentile<quantile>(aggField)
stats percentile_approx<quantile>(aggField)

Option 2 (more readable and widely used in OLAP engines)

stats percentile(aggField, quantile)
stats percentile_approx(aggField, quantile)

quantile: The bounds of quantile must be in [0, 100]

Syntax in SQL

Basic

PERCENTILE(functionArg, quantile)
PERCENTILE_APPROX(functionArg, quantile)

ANSI SQL (experimental)

PERCENTILE_CONT(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]
PERCENTILE_DISC(quantile) WITHIN GROUP [ORDER BY expression [ASC | DESC]]

Window function: adding OVER ([PARTITION BY expression])

For example, mainstream database supports percentile_cont and percentile_disc
Postgresql
https://www.postgresql.org/docs/9.4/functions-aggregate.html
RedShift
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
Snowflake
https://docs.snowflake.com/en/sql-reference/functions/percentile_disc.html
Spark
https://issues.apache.org/jira/browse/SPARK-37691

Solution

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

What alternatives have you considered?
No, due to percentiles in legacy SQL engine couldn't work in PPL.

Do you have any additional context?
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

LantaoJin · 2024-05-14T09:57:22Z

PoC: https://github.com/LantaoJin/search-plugins-sql/tree/poc/percentile

LantaoJin · 2024-05-15T08:37:17Z

cc @dai-chen, @penghuo

penghuo · 2024-05-15T15:46:23Z

Option 2 (more readable and widely used in OLAP engines)

I perfer option2

To align with the current percentiles(aggField) implementation in legacy engine and percentiles agg in OpenSearch core, for percentile we will use t-digest construction algorithm which is an approximate calculation.

Do u proposal add another aggregator? Currently, SQL aggregation framework executed query plan on coordination node (if it can not be push down to OpenSearch). In future, we want to leverage Spark aggregation framework, instead of re-invent on it.

For percentile_cont and percentile_disc, we could use org.apache.commons.math3:Percentile instead. For example, setting EstimationType R_1 for percentile_disc and R_7 for percentile_cont. This two R_x quantile algorithms are very popular, which used in Spark, PostgreSQL and Excel, etc. Reference: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html

Is it possible to implement it in OpenSearch Core? and SQL/PPL can leverage it.

LantaoJin added enhancement New feature or request untriaged labels May 14, 2024

LantaoJin mentioned this issue May 15, 2024

[POC] Support percentile agg in PPL #2673

Closed

penghuo removed the untriaged label May 15, 2024

LantaoJin mentioned this issue May 21, 2024

Support Percentile in PPL #2688

Merged

6 tasks

penghuo closed this as completed Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Support Percentile in PPL #2670

[RFC] Support Percentile in PPL #2670

LantaoJin commented May 14, 2024 •

edited

Loading

LantaoJin commented May 14, 2024

LantaoJin commented May 15, 2024

penghuo commented May 15, 2024 •

edited

Loading

[RFC] Support Percentile in PPL #2670

[RFC] Support Percentile in PPL #2670

Comments

LantaoJin commented May 14, 2024 • edited Loading

Syntax in PPL

Syntax in SQL

Solution

LantaoJin commented May 14, 2024

LantaoJin commented May 15, 2024

penghuo commented May 15, 2024 • edited Loading

LantaoJin commented May 14, 2024 •

edited

Loading

penghuo commented May 15, 2024 •

edited

Loading