-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]New fieldsummary
PPL command
#662
Comments
Given the above specifications, the pushed-down query to generate a Query:
Logical Plan: * translate the field summary into the following query:
* -----------------------------------------------------
* // for each column create statement:
* SELECT
* 'column-1' AS Field,
* COUNT(column-1) AS Count,
* COUNT(DISTINCT column-1) AS Distinct,
* MIN(column-1) AS Min,
* MAX(column-1) AS Max,
* AVG(CAST(column-1 AS DOUBLE)) AS Avg,
* typeof(column-1) AS Type,
* (SELECT COLLECT_LIST(STRUCT(column-1, count_status))
* FROM (
* SELECT column-1, COUNT(*) AS count_status
* FROM $testTable
* GROUP BY column-1
* ORDER BY count_status DESC
* LIMIT 5
* )) AS top_values,
* COUNT(*) - COUNT(column-1) AS Nulls
* FROM $testTable
* GROUP BY typeof(column-1)
*
* // union all queries
* UNION ALL
*
* SELECT
* 'column-2' AS Field,
* COUNT(column-2) AS Count,
* COUNT(DISTINCT column-2) AS Distinct,
* MIN(column-2) AS Min,
* MAX(column-2) AS Max,
* AVG(CAST(column-2 AS DOUBLE)) AS Avg,
* typeof(column-2) AS Type,
* (SELECT COLLECT_LIST(STRUCT(column-2, count_column-2))
* FROM (
* SELECT column-, COUNT(*) AS count_column-
* FROM $testTable
* GROUP BY column-2
* ORDER BY count_column- DESC
* LIMIT 5
* )) AS top_values,
* COUNT(*) - COUNT(column-2) AS Nulls
* FROM $testTable
* GROUP BY typeof(column-2) where for each column name we will produce such a summary query which will be union in the subsequent response The logical plan here: 'Union false, false
:- 'Aggregate ['typeof('status_code)], [status_code AS Field#20, 'COUNT('status_code) AS Count#21, 'COUNT(distinct 'status_code) AS Distinct#22, 'MIN('status_code) AS Min#23, 'MAX('status_code) AS Max#24, 'AVG(cast('status_code as double)) AS Avg#25, 'typeof('status_code) AS Type#26, scalar-subquery#28 [] AS top_values#29, ('COUNT(1) - 'COUNT('status_code)) AS Nulls#30]
: : +- 'Project [unresolvedalias('COLLECT_LIST(struct(status_code, 'status_code, count_status, 'count_status)), None)]
: : +- 'SubqueryAlias __auto_generated_subquery_name
: : +- 'GlobalLimit 5
: : +- 'LocalLimit 5
: : +- 'Sort ['count_status DESC NULLS LAST], true
: : +- 'Aggregate ['status_code], ['status_code, 'COUNT(1) AS count_status#27]
: : +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
: +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
+- 'Aggregate ['typeof('id)], [id AS Field#31, 'COUNT('id) AS Count#32, 'COUNT(distinct 'id) AS Distinct#33, 'MIN('id) AS Min#34, 'MAX('id) AS Max#35, 'AVG(cast('id as double)) AS Avg#36, 'typeof('id) AS Type#37, scalar-subquery#39 [] AS top_values#40, ('COUNT(1) - 'COUNT('id)) AS Nulls#41]
: +- 'Project [unresolvedalias('COLLECT_LIST(struct(id, 'id, count_id, 'count_id)), None)]
: +- 'SubqueryAlias __auto_generated_subquery_name
: +- 'GlobalLimit 5
: +- 'LocalLimit 5
: +- 'Sort ['count_id DESC NULLS LAST], true
: +- 'Aggregate ['id], ['id, 'COUNT(1) AS count_id#38]
: +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
+- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false |
@LantaoJin I would appreciate u'r feedback here since I suspect this type of query may cause a significant compute - any best practices recommended ? |
@YANG-DB I think we don't need to do this rewriting. Spark itself contains a similar API call Here is an example: val sourceData = Seq(
(1, "1066", "Failed password", "Engineering"),
(2, "1815", "Failed password", "IT"),
(3, "1916", "Session closed", null),
(4, null, "Failed password", null),
(5, "1690", "Session closed", "Engineering"),
(6, "1090", "Session closed", "Engineering")
).toDF("id", "uid", "action", "department")
sourceData.describe("id", "uid", "action", "department").show The output is
The above table display is more reasonable compare to below one since the below table need to Union
|
Describe the solution you'd like
We propose adding a new
fieldsummary
command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.This command should:
Additionally, the command should support the following key optional parameters:
Specify which fields to include in the summary (e.g.,
| fieldsummary includefields="status_code,user_id,response_time"
)Specify which fields to exclude from the summary (e.g.,
| fieldsummary excludefields="internal_id,debug_info"
)Set the number of top values to display for each field (e.g.,
| fieldsummary topvalues=5
)Limit the number of fields to display (e.g.,
| fieldsummary maxfields=20
)Include null/empty value counts (e.g.,
| fieldsummary nulls=true
)Example usage:
This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.
Example output:
404
500
user456
user789
0.75
1.0
The text was updated successfully, but these errors were encountered: