[BUG] Rare with group by cause job abort due to result size too large #611

seankao-az · 2024-08-29T21:27:37Z

What is the bug?
Running source=myglue_test.default.http_logs | rare request by clientip
gets error:

{
  "status": "FAILED",
  "error": "{\"Message\":\"Spark exception. Cause: Job aborted due to stage failure: Total size of serialized results of 118 tasks (1032.9 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)\"}"
}

However, an equivalent SQL query: select count(*) as cnt, request, clientip from myglue_test.default.http_logs group by request, clientip order by cnt asc limit 10 could get the result correctly:

SQL query result

{
	"status": "SUCCESS",
	"schema": [
		{
			"name": "cnt",
			"type": "long"
		},
		{
			"name": "request",
			"type": "string"
		},
		{
			"name": "clientip",
			"type": "string"
		}
	],
	"datarows": [
		[
			1,
			"GET /images/102321.gif HTTP/1.0",
			"120.125.16.0"
		],
		[
			1,
			"GET /images/102327.gif HTTP/1.0",
			"183.200.6.0"
		],
		[
			1,
			"GET /images/102321.gif HTTP/1.0",
			"89.57.14.0"
		],
		[
			1,
			"GET /images/102321.gif HTTP/1.0",
			"147.116.16.0"
		],
		[
			1,
			"GET /images/102321.gif HTTP/1.1",
			"224.58.14.0"
		],
		[
			1,
			"GET /images/102321.gif HTTP/1.0",
			"178.72.4.0"
		],
		[
			1,
			"GET /english/images/comp_bu_stage1n.gif HTTP/1.0",
			"195.128.16.0"
		],
		[
			1,
			"GET /english/images/comp_bu_stage1n.gif HTTP/1.0",
			"26.52.13.0"
		],
		[
			1,
			"GET /images/102321.gif HTTP/1.1",
			"77.76.14.0"
		],
		[
			1,
			"GET /english/images/team_bu_roster_on.gif HTTP/1.1",
			"142.58.14.0"
		]
	],
	"total": 10,
	"size": 10
}

The LIMIT 10 clause might cause a difference in SQL query, but I add it because rare command defaults to size 10 as well

Physical plan for PPL query:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(4) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false, true) AS value#32]
   +- MapPartitions org.apache.spark.sql.Dataset$$Lambda$3843/0x00007f070d20d1d0@4851585e, obj#31: java.lang.String
      +- DeserializeToObject createexternalrow(staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, count(request)#8L, true, false, true), request#12.toString, clientip#11.toString, StructField(count(request),LongType,false), StructField(request,StringType,true), StructField(clientip,StringType,true)), obj#30: org.apache.spark.sql.Row
         +- *(3) Sort [request#12 DESC NULLS LAST], true, 0
            +- AQEShuffleRead coalesced
               +- ShuffleQueryStage 1
                  +- Exchange rangepartitioning(request#12 DESC NULLS LAST, 1000), ENSURE_REQUIREMENTS, [plan_id=98]
                     +- *(2) HashAggregate(keys=[request#12, clientip#11], functions=[count(request#12)], output=[count(request)#8L, request#12, clientip#11], schema specialized)
                        +- AQEShuffleRead coalesced
                           +- ShuffleQueryStage 0
                              +- Exchange hashpartitioning(request#12, clientip#11, 1000), ENSURE_REQUIREMENTS, [plan_id=52]
                                 +- *(1) HashAggregate(keys=[request#12, clientip#11], functions=[partial_count(request#12)], output=[request#12, clientip#11, count#34L], schema specialized)
                                    +- *(1) Project [clientip#11, request#12]
                                       +- FileScan json spark_catalog.default.http_logs[clientip#11,request#12,year#15,month#16,day#17] Batched: false, DataFilters: [], Format: JSON, Location: CatalogFileIndex(1 paths)[s3://..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<clientip:string,request:string>

Physical plan for SQL query:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(3) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false, true) AS value#31]
   +- MapPartitions org.apache.spark.sql.Dataset$$Lambda$3865/0x00007fa68d23a778@509772ad, obj#30: java.lang.String
      +- DeserializeToObject createexternalrow(staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cnt#8L, true, false, true), request#12.toString, clientip#11.toString, StructField(cnt,LongType,false), StructField(request,StringType,true), StructField(clientip,StringType,true)), obj#29: org.apache.spark.sql.Row
         +- TakeOrderedAndProject(limit=10, orderBy=[cnt#8L ASC NULLS FIRST], output=[cnt#8L,request#12,clientip#11])
            +- *(2) HashAggregate(keys=[request#12, clientip#11], functions=[count(1)], output=[cnt#8L, request#12, clientip#11], schema specialized)
               +- AQEShuffleRead coalesced
                  +- ShuffleQueryStage 0
                     +- Exchange hashpartitioning(request#12, clientip#11, 1000), ENSURE_REQUIREMENTS, [plan_id=48]
                        +- *(1) HashAggregate(keys=[request#12, clientip#11], functions=[partial_count(1)], output=[request#12, clientip#11, count#33L], schema specialized)
                           +- *(1) Project [clientip#11, request#12]
                              +- FileScan json spark_catalog.default.http_logs[clientip#11,request#12,year#15,month#16,day#17] Batched: false, DataFilters: [], Format: JSON, Location: CatalogFileIndex(1 paths)[s3:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<clientip:string,request:string>

What is your host/environment?

AOS: 2.15
EMR-S: emr-7.2.0
Flint: 0.5

The text was updated successfully, but these errors were encountered:

penghuo · 2024-08-29T22:22:14Z

LIMIT 10 is not the same as Rare. To achieve the desired result, we should use a window function to limit the output to 10 records per group.

SELECT 
    request, 
    clientip, 
    cnt
FROM (
    SELECT 
        request, 
        clientip, 
        COUNT(*) AS cnt, 
        ROW_NUMBER() OVER (PARTITION BY request, clientip ORDER BY cnt ASC) AS rn
    FROM 
        myglue_test.default.http_logs
    GROUP BY 
        request, clientip
) AS RankedLogs
WHERE 
    rn <= 10
ORDER BY 
    cnt ASC;

seankao-az · 2024-08-29T22:42:29Z

Fail to analyze query. Cause: [UNSUPPORTED_FEATURE.LATERAL_COLUMN_ALIAS_IN_WINDOW] The feature is not supported: Referencing a lateral column alias `cnt` in window expression "row_number() OVER (PARTITION BY request, clientip ORDER BY lateralAliasReference(cnt) ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)"

modifying the query to:

SELECT 
    request, 
    clientip, 
    cnt
FROM (
    SELECT 
        request, 
        clientip, 
        COUNT(*) AS cnt, 
        ROW_NUMBER() OVER (PARTITION BY request, clientip ORDER BY COUNT(*) ASC) AS rn
    FROM 
        myglue_test.default.http_logs
    GROUP BY 
        request, clientip
) AS RankedLogs
WHERE 
    rn <= 10
ORDER BY 
    cnt ASC;

results in:

{
	"status": "FAILED",
	"error": "{\"Message\":\"Spark exception. Cause: Job aborted due to stage failure: Total size of serialized results of 12 tasks (1905.8 MiB) is bigger than spark.driver.maxResultSize (1024.0 MiB)\"}"
}

The physical plan being:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(5) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false, true) AS value#35]
   +- MapPartitions org.apache.spark.sql.Dataset$$Lambda$3910/0x00007f4501250f10@10b46c72, obj#34: java.lang.String
      +- DeserializeToObject createexternalrow(request#15.toString, clientip#14.toString, staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cnt#8L, true, false, true), StructField(request,StringType,true), StructField(clientip,StringType,true), StructField(cnt,LongType,false)), obj#33: org.apache.spark.sql.Row
         +- *(4) Sort [cnt#8L ASC NULLS FIRST], true, 0
            +- AQEShuffleRead coalesced
               +- ShuffleQueryStage 1
                  +- Exchange rangepartitioning(cnt#8L ASC NULLS FIRST, 1000), ENSURE_REQUIREMENTS, [plan_id=205]
                     +- *(3) Project [request#15, clientip#14, cnt#8L]
                        +- *(3) Filter (rn#9 <= 10)
                           +- Window [row_number() windowspecdefinition(request#15, clientip#14, _w0#21L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rn#9], [request#15, clientip#14], [_w0#21L ASC NULLS FIRST]
                              +- WindowGroupLimit [request#15, clientip#14], [_w0#21L ASC NULLS FIRST], row_number(), 10, Final
                                 +- *(2) Sort [request#15 ASC NULLS FIRST, clientip#14 ASC NULLS FIRST, _w0#21L ASC NULLS FIRST], false, 0
                                    +- *(2) HashAggregate(keys=[request#15, clientip#14], functions=[count(1)], output=[request#15, clientip#14, cnt#8L, _w0#21L], schema specialized)
                                       +- AQEShuffleRead coalesced
                                          +- ShuffleQueryStage 0
                                             +- Exchange hashpartitioning(request#15, clientip#14, 1000), ENSURE_REQUIREMENTS, [plan_id=97]
                                                +- *(1) HashAggregate(keys=[request#15, clientip#14], functions=[partial_count(1)], output=[request#15, clientip#14, count#37L], schema specialized)
                                                   +- *(1) Project [clientip#14, request#15]
                                                      +- FileScan json spark_catalog.default.http_logs[clientip#14,request#15,year#18,month#19,day#20] Batched: false, DataFilters: [], Format: JSON, Location: CatalogFileIndex(1 paths)[s3:..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<clientip:string,request:string>

Now the plan looks way different with the window function.

seankao-az · 2024-08-29T22:43:04Z

But regardless, same failure can happen to SQL as well. Not a PPL rare command issue. And it can be solved by scaling up driver node or changing spark.driver.maxResultSize as error message suggests

YANG-DB · 2024-10-21T22:40:20Z

The new sampletable command will help reduce the amount of scanned data in exchange to precision

seankao-az added bug Something isn't working untriaged labels Aug 29, 2024

seankao-az added this to PPL Commands Aug 29, 2024

seankao-az closed this as completed Aug 29, 2024

github-project-automation bot moved this to Done in PPL Commands Aug 29, 2024

seankao-az added the Lang:PPL Pipe Processing Language support label Aug 29, 2024

seankao-az moved this from Done to non-issue in PPL Commands Aug 29, 2024

YANG-DB removed the untriaged label Oct 21, 2024

YANG-DB self-assigned this Oct 21, 2024

YANG-DB reopened this Oct 21, 2024

github-actions bot added the untriaged label Oct 21, 2024

YANG-DB removed the untriaged label Oct 25, 2024

YANG-DB moved this from non-issue to In Progress in PPL Commands Oct 25, 2024

YANG-DB moved this from In Progress to Done in PPL Commands Nov 12, 2024

YANG-DB closed this as completed by moving to Done in PPL Commands Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Rare with group by cause job abort due to result size too large #611

[BUG] Rare with group by cause job abort due to result size too large #611

seankao-az commented Aug 29, 2024 •

edited

Loading

penghuo commented Aug 29, 2024

seankao-az commented Aug 29, 2024

seankao-az commented Aug 29, 2024 •

edited

Loading

YANG-DB commented Oct 21, 2024

[BUG] Rare with group by cause job abort due to result size too large #611

[BUG] Rare with group by cause job abort due to result size too large #611

Comments

seankao-az commented Aug 29, 2024 • edited Loading

penghuo commented Aug 29, 2024

seankao-az commented Aug 29, 2024

seankao-az commented Aug 29, 2024 • edited Loading

YANG-DB commented Oct 21, 2024

seankao-az commented Aug 29, 2024 •

edited

Loading

seankao-az commented Aug 29, 2024 •

edited

Loading