[FEATURE]New `fieldsummary` PPL command #662

YANG-DB · 2024-09-14T00:18:38Z

Describe the solution you'd like
We propose adding a new fieldsummary command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.

This command should:

Calculate basic statistics for each field (count, distinct count, min, max, avg for numeric fields)
Determine the data type of each field
Show the most frequent values and their counts for each field
Calculate the percentage of events that contain each field

Additionally, the command should support the following key optional parameters:

includefields:
Specify which fields to include in the summary (e.g., | fieldsummary includefields="status_code,user_id,response_time")
excludefields:
Specify which fields to exclude from the summary (e.g., | fieldsummary excludefields="internal_id,debug_info")
topvalues:
Set the number of top values to display for each field (e.g., | fieldsummary topvalues=5)
maxfields:
Limit the number of fields to display (e.g., | fieldsummary maxfields=20)
nulls:
Include null/empty value counts (e.g., | fieldsummary nulls=true)

Example usage:

source = t
| where timestamp >= "2023-01-01" and timestamp < "2023-02-01"
| fieldsummary includefields="status_code,user_id,response_time" topvalues=3 nulls=true

This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.

Example output:

Field	Count	Distinct	Min	Max	Avg	Type	Top Values	Nulls
status_code	10000	4	200	503	-	short	200 404 500	0
user_id	9500	1200	-	-	-	string	user123 user456 user789	500
response_time	10000	986	0.01	10.5	0.75	float	0.5 0.75 1.0	0

The text was updated successfully, but these errors were encountered:

YANG-DB · 2024-10-07T19:32:59Z

Given the above specifications, the pushed-down query to generate a fieldsummary should take the next form:

Query:

source = $testTable | fieldsummary includefields= status_code, id, response_time topvalues=5 nulls=true

Logical Plan:

     * translate the field summary into the following query:
     * -----------------------------------------------------
     *  // for each column create statement:
     *  SELECT
     *      'column-1' AS Field,
     *      COUNT(column-1) AS Count,
     *      COUNT(DISTINCT column-1) AS Distinct,
     *      MIN(column-1) AS Min,
     *      MAX(column-1) AS Max,
     *      AVG(CAST(column-1 AS DOUBLE)) AS Avg,
     *      typeof(column-1) AS Type,
     *      (SELECT COLLECT_LIST(STRUCT(column-1, count_status))
     *       FROM (
     *          SELECT column-1, COUNT(*) AS count_status
     *          FROM $testTable
     *          GROUP BY column-1
     *          ORDER BY count_status DESC
     *          LIMIT 5
     *      )) AS top_values,
     *      COUNT(*) - COUNT(column-1) AS Nulls
     *  FROM $testTable
     *  GROUP BY typeof(column-1)                       
     *  
     *  // union all queries
     *  UNION ALL
     *
     *  SELECT
     *      'column-2' AS Field,
     *      COUNT(column-2) AS Count,
     *      COUNT(DISTINCT column-2) AS Distinct,
     *      MIN(column-2) AS Min,
     *      MAX(column-2) AS Max,
     *      AVG(CAST(column-2 AS DOUBLE)) AS Avg,
     *      typeof(column-2) AS Type,
     *      (SELECT COLLECT_LIST(STRUCT(column-2, count_column-2))
     *       FROM (
     *          SELECT column-, COUNT(*) AS count_column-
     *          FROM $testTable
     *          GROUP BY column-2
     *          ORDER BY count_column- DESC
     *          LIMIT 5
     *      )) AS top_values,
     *      COUNT(*) - COUNT(column-2) AS Nulls
     *  FROM $testTable
     *  GROUP BY typeof(column-2)

where for each column name we will produce such a summary query which will be union in the subsequent response

The logical plan here:

'Union false, false
:- 'Aggregate ['typeof('status_code)], [status_code AS Field#20, 'COUNT('status_code) AS Count#21, 'COUNT(distinct 'status_code) AS Distinct#22, 'MIN('status_code) AS Min#23, 'MAX('status_code) AS Max#24, 'AVG(cast('status_code as double)) AS Avg#25, 'typeof('status_code) AS Type#26, scalar-subquery#28 [] AS top_values#29, ('COUNT(1) - 'COUNT('status_code)) AS Nulls#30]
:  :  +- 'Project [unresolvedalias('COLLECT_LIST(struct(status_code, 'status_code, count_status, 'count_status)), None)]
:  :     +- 'SubqueryAlias __auto_generated_subquery_name
:  :        +- 'GlobalLimit 5
:  :           +- 'LocalLimit 5
:  :              +- 'Sort ['count_status DESC NULLS LAST], true
:  :                 +- 'Aggregate ['status_code], ['status_code, 'COUNT(1) AS count_status#27]
:  :                    +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
:  +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
+- 'Aggregate ['typeof('id)], [id AS Field#31, 'COUNT('id) AS Count#32, 'COUNT(distinct 'id) AS Distinct#33, 'MIN('id) AS Min#34, 'MAX('id) AS Max#35, 'AVG(cast('id as double)) AS Avg#36, 'typeof('id) AS Type#37, scalar-subquery#39 [] AS top_values#40, ('COUNT(1) - 'COUNT('id)) AS Nulls#41]
   :  +- 'Project [unresolvedalias('COLLECT_LIST(struct(id, 'id, count_id, 'count_id)), None)]
   :     +- 'SubqueryAlias __auto_generated_subquery_name
   :        +- 'GlobalLimit 5
   :           +- 'LocalLimit 5
   :              +- 'Sort ['count_id DESC NULLS LAST], true
   :                 +- 'Aggregate ['id], ['id, 'COUNT(1) AS count_id#38]
   :                    +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
   +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false

YANG-DB · 2024-10-15T18:31:28Z

@LantaoJin I would appreciate u'r feedback here since I suspect this type of query may cause a significant compute - any best practices recommended ?

LantaoJin · 2024-10-16T08:45:52Z

@YANG-DB I think we don't need to do this rewriting. Spark itself contains a similar API call Dataset.describe, you can refer https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/api/Dataset.scala#L2486 to build a new implementation. Or more specific, try https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

Here is an example:

val sourceData = Seq(
      (1, "1066", "Failed password", "Engineering"),
      (2, "1815", "Failed password", "IT"),
      (3, "1916", "Session closed", null),
      (4, null, "Failed password", null),
      (5, "1690", "Session closed", "Engineering"),
      (6, "1090", "Session closed", "Engineering")
    ).toDF("id", "uid", "action", "department")

sourceData.describe("id", "uid", "action", "department").show

The output is

+-------+------------------+------------------+---------------+-----------+
|summary|                id|               uid|         action| department|
+-------+------------------+------------------+---------------+-----------+
|  count|                 6|                 5|              6|          4|
|   mean|               3.5|            1515.4|           NULL|       NULL|
| stddev|1.8708286933869707|407.32394970097204|           NULL|       NULL|
|    min|                 1|              1066|Failed password|Engineering|
|    max|                 6|              1916| Session closed|         IT|
+-------+------------------+------------------+---------------+-----------+

The above table display is more reasonable compare to below one since the below table need to Union

Field	count	mean	stddev	min	max
id	6	3.5	1.8708286933869707	1	6
uid	5	1515.4	407.32394970097204	1066	1916
action	6	NULL	NULL	Failed password	Session closed
department	4	NULL	NULL	Engineering	IT

YANG-DB added enhancement New feature or request untriaged Lang:PPL Pipe Processing Language support labels Sep 14, 2024

YANG-DB mentioned this issue Sep 14, 2024

[META]New fieldsummary PPL command opensearch-project/sql#3025

Open

YANG-DB added this to PPL Commands Sep 14, 2024

YANG-DB moved this to Todo in PPL Commands Sep 14, 2024

YANG-DB removed the untriaged label Sep 14, 2024

YANG-DB self-assigned this Sep 14, 2024

YANG-DB mentioned this issue Oct 7, 2024

[BUG][SanityTest] stats by a high cardinality field will cause writing job fail with "size exceed limitation" #740

Closed

YANG-DB added the 0.6 label Oct 9, 2024

YANG-DB moved this from Todo to In Progress in PPL Commands Oct 10, 2024

YANG-DB mentioned this issue Oct 11, 2024

PPL fieldsummary command #766

Merged

YANG-DB moved this from In Progress to InReview in PPL Commands Oct 17, 2024

YANG-DB moved this from InReview to Done in PPL Commands Oct 28, 2024

YANG-DB closed this as completed by moving to Done in PPL Commands Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]New `fieldsummary` PPL command #662

[FEATURE]New `fieldsummary` PPL command #662

YANG-DB commented Sep 14, 2024 •

edited

Loading

YANG-DB commented Oct 7, 2024 •

edited

Loading

YANG-DB commented Oct 15, 2024

LantaoJin commented Oct 16, 2024 •

edited

Loading

[FEATURE]New fieldsummary PPL command #662

[FEATURE]New fieldsummary PPL command #662

Comments

YANG-DB commented Sep 14, 2024 • edited Loading

YANG-DB commented Oct 7, 2024 • edited Loading

YANG-DB commented Oct 15, 2024

LantaoJin commented Oct 16, 2024 • edited Loading

[FEATURE]New `fieldsummary` PPL command #662

[FEATURE]New `fieldsummary` PPL command #662

YANG-DB commented Sep 14, 2024 •

edited

Loading

YANG-DB commented Oct 7, 2024 •

edited

Loading

LantaoJin commented Oct 16, 2024 •

edited

Loading