Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]New fieldsummary PPL command #662

Closed
YANG-DB opened this issue Sep 14, 2024 · 3 comments
Closed

[FEATURE]New fieldsummary PPL command #662

YANG-DB opened this issue Sep 14, 2024 · 3 comments
Assignees
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Sep 14, 2024

Describe the solution you'd like
We propose adding a new fieldsummary command to OpenSearch PPL that would provide summary statistics for all fields in the current result set.

This command should:

  1. Calculate basic statistics for each field (count, distinct count, min, max, avg for numeric fields)
  2. Determine the data type of each field
  3. Show the most frequent values and their counts for each field
  4. Calculate the percentage of events that contain each field

Additionally, the command should support the following key optional parameters:

  1. includefields:
    Specify which fields to include in the summary (e.g., | fieldsummary includefields="status_code,user_id,response_time")
  2. excludefields:
    Specify which fields to exclude from the summary (e.g., | fieldsummary excludefields="internal_id,debug_info")
  3. topvalues:
    Set the number of top values to display for each field (e.g., | fieldsummary topvalues=5)
  4. maxfields:
    Limit the number of fields to display (e.g., | fieldsummary maxfields=20)
  5. nulls:
    Include null/empty value counts (e.g., | fieldsummary nulls=true)

Example usage:

source = t
| where timestamp >= "2023-01-01" and timestamp < "2023-02-01"
| fieldsummary includefields="status_code,user_id,response_time" topvalues=3 nulls=true

This command would generate a table with summary statistics for the specified fields in the given date range, showing the top 3 values for each field and including null counts.

Example output:

Field Count Distinct Min Max Avg Type Top Values Nulls
status_code 10000 4 200 503 - short 200
404
500
0
user_id 9500 1200 - - - string user123
user456
user789
500
response_time 10000 986 0.01 10.5 0.75 float 0.5
0.75
1.0
0
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 7, 2024

Given the above specifications, the pushed-down query to generate a fieldsummary should take the next form:

Query:

source = $testTable | fieldsummary includefields= status_code, id, response_time topvalues=5 nulls=true

Logical Plan:

     * translate the field summary into the following query:
     * -----------------------------------------------------
     *  // for each column create statement:
     *  SELECT
     *      'column-1' AS Field,
     *      COUNT(column-1) AS Count,
     *      COUNT(DISTINCT column-1) AS Distinct,
     *      MIN(column-1) AS Min,
     *      MAX(column-1) AS Max,
     *      AVG(CAST(column-1 AS DOUBLE)) AS Avg,
     *      typeof(column-1) AS Type,
     *      (SELECT COLLECT_LIST(STRUCT(column-1, count_status))
     *       FROM (
     *          SELECT column-1, COUNT(*) AS count_status
     *          FROM $testTable
     *          GROUP BY column-1
     *          ORDER BY count_status DESC
     *          LIMIT 5
     *      )) AS top_values,
     *      COUNT(*) - COUNT(column-1) AS Nulls
     *  FROM $testTable
     *  GROUP BY typeof(column-1)                       
     *  
     *  // union all queries
     *  UNION ALL
     *
     *  SELECT
     *      'column-2' AS Field,
     *      COUNT(column-2) AS Count,
     *      COUNT(DISTINCT column-2) AS Distinct,
     *      MIN(column-2) AS Min,
     *      MAX(column-2) AS Max,
     *      AVG(CAST(column-2 AS DOUBLE)) AS Avg,
     *      typeof(column-2) AS Type,
     *      (SELECT COLLECT_LIST(STRUCT(column-2, count_column-2))
     *       FROM (
     *          SELECT column-, COUNT(*) AS count_column-
     *          FROM $testTable
     *          GROUP BY column-2
     *          ORDER BY count_column- DESC
     *          LIMIT 5
     *      )) AS top_values,
     *      COUNT(*) - COUNT(column-2) AS Nulls
     *  FROM $testTable
     *  GROUP BY typeof(column-2) 

where for each column name we will produce such a summary query which will be union in the subsequent response

The logical plan here:

'Union false, false
:- 'Aggregate ['typeof('status_code)], [status_code AS Field#20, 'COUNT('status_code) AS Count#21, 'COUNT(distinct 'status_code) AS Distinct#22, 'MIN('status_code) AS Min#23, 'MAX('status_code) AS Max#24, 'AVG(cast('status_code as double)) AS Avg#25, 'typeof('status_code) AS Type#26, scalar-subquery#28 [] AS top_values#29, ('COUNT(1) - 'COUNT('status_code)) AS Nulls#30]
:  :  +- 'Project [unresolvedalias('COLLECT_LIST(struct(status_code, 'status_code, count_status, 'count_status)), None)]
:  :     +- 'SubqueryAlias __auto_generated_subquery_name
:  :        +- 'GlobalLimit 5
:  :           +- 'LocalLimit 5
:  :              +- 'Sort ['count_status DESC NULLS LAST], true
:  :                 +- 'Aggregate ['status_code], ['status_code, 'COUNT(1) AS count_status#27]
:  :                    +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
:  +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
+- 'Aggregate ['typeof('id)], [id AS Field#31, 'COUNT('id) AS Count#32, 'COUNT(distinct 'id) AS Distinct#33, 'MIN('id) AS Min#34, 'MAX('id) AS Max#35, 'AVG(cast('id as double)) AS Avg#36, 'typeof('id) AS Type#37, scalar-subquery#39 [] AS top_values#40, ('COUNT(1) - 'COUNT('id)) AS Nulls#41]
   :  +- 'Project [unresolvedalias('COLLECT_LIST(struct(id, 'id, count_id, 'count_id)), None)]
   :     +- 'SubqueryAlias __auto_generated_subquery_name
   :        +- 'GlobalLimit 5
   :           +- 'LocalLimit 5
   :              +- 'Sort ['count_id DESC NULLS LAST], true
   :                 +- 'Aggregate ['id], ['id, 'COUNT(1) AS count_id#38]
   :                    +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false
   +- 'UnresolvedRelation [spark_catalog, default, flint_ppl_test], [], false

@YANG-DB YANG-DB added the 0.6 label Oct 9, 2024
@YANG-DB YANG-DB moved this from Todo to In Progress in PPL Commands Oct 10, 2024
@YANG-DB
Copy link
Member Author

YANG-DB commented Oct 15, 2024

@LantaoJin I would appreciate u'r feedback here since I suspect this type of query may cause a significant compute - any best practices recommended ?

@LantaoJin
Copy link
Member

LantaoJin commented Oct 16, 2024

@YANG-DB I think we don't need to do this rewriting. Spark itself contains a similar API call Dataset.describe, you can refer https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/api/Dataset.scala#L2486 to build a new implementation. Or more specific, try https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

Here is an example:

val sourceData = Seq(
      (1, "1066", "Failed password", "Engineering"),
      (2, "1815", "Failed password", "IT"),
      (3, "1916", "Session closed", null),
      (4, null, "Failed password", null),
      (5, "1690", "Session closed", "Engineering"),
      (6, "1090", "Session closed", "Engineering")
    ).toDF("id", "uid", "action", "department")

sourceData.describe("id", "uid", "action", "department").show

The output is

+-------+------------------+------------------+---------------+-----------+
|summary|                id|               uid|         action| department|
+-------+------------------+------------------+---------------+-----------+
|  count|                 6|                 5|              6|          4|
|   mean|               3.5|            1515.4|           NULL|       NULL|
| stddev|1.8708286933869707|407.32394970097204|           NULL|       NULL|
|    min|                 1|              1066|Failed password|Engineering|
|    max|                 6|              1916| Session closed|         IT|
+-------+------------------+------------------+---------------+-----------+

The above table display is more reasonable compare to below one since the below table need to Union

Field count mean stddev min max
id 6 3.5 1.8708286933869707 1 6
uid 5 1515.4 407.32394970097204 1066 1916
action 6 NULL NULL Failed password Session closed
department 4 NULL NULL Engineering IT

@YANG-DB YANG-DB moved this from In Progress to InReview in PPL Commands Oct 17, 2024
@YANG-DB YANG-DB moved this from InReview to Done in PPL Commands Oct 28, 2024
@YANG-DB YANG-DB closed this as completed by moving to Done in PPL Commands Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.6 enhancement New feature or request Lang:PPL Pipe Processing Language support
Projects
Status: Done
Development

No branches or pull requests

2 participants