[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

anishshri-db · 2023-08-18T20:09:44Z

What changes were proposed in this pull request?

Disable strict limit for RocksDB write manager to avoid insertion exception on cache full

Why are the changes needed?

In some cases, if the memory limit is reached, on insert/get, we are seeing the following exception

org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 9.0 failed 4 times, most recent failure: Lost task 42.3 in stage 9.0 (TID 2950) (96.104.176.55 executor 0): org.rocksdb.RocksDBException: Insert failed due to LRU cache being full.
	at org.rocksdb.RocksDB.get(Native Method)
	at org.rocksdb.RocksDB.get(RocksDB.java:2053)
	at org.apache.spark.sql.execution.streaming.state.RocksDB.get(RocksDB.scala:299)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$RocksDBStateStore.get(RocksDBStateStoreProvider.scala:55)

It seems this is being thrown with strict memory limit within RocksDB here - https://github.com/facebook/rocksdb/blob/0fa0c97d3e9ac5dfc2e7ae94834b0850cdef5df7/cache/lru_cache.cc#L394

It seems this issue can only happen with the strict mode as described here - facebook/rocksdb#5048 (comment)

Seems like there is a pending issue for RocksDB around this as well - facebook/rocksdb#8670

There is probably a relevant fix, but not sure whether this addresses the issue completely - facebook/rocksdb#6619
(cc - @siying )

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

… insertion exception on cache full

anishshri-db · 2023-08-18T20:10:43Z

cc - @HeartSaVioR , @siying - PTAL, thx

siying · 2023-08-21T16:50:41Z

When block cache in strict mode is full, we try to insert more data, some exception will be thrown one way of another. I don't think SS has a way to handle the error, and the consequence will be task failure. If that is not a behavior we can afford, we should disable strict mode.

It is a trade-off here: either RocksDB reject queries and background tasks that require more memory than needed, or we use more memory than configured. RocksDB might not have a better way to handle it. It is a less used feature, and is mainly used in those applications that would rather to fail than using more configured memory. An example, sometimes people don't feel comfortable for an administrative demon to use more memory and crash a host that might serves online queries.

I think it is a decision Spark SS needs to make. When users misconfigured memory cap that turns out to be not enough, are we going to fail their queries, or are we going to allow RocksDB to use more memory than needed and potentially cause an OOM. I can see both pros and cons on it.

anishshri-db · 2023-08-21T17:37:41Z

@siying - how does the cache eviction mechanism work though for RocksDB ? since we are accounting everything towards block cache, is it not possible to spill write buffer/block cache blocks to disk if we hit this cache full case ? Maybe we won't be able to evict pinned blocks, but what about other blocks ?

Also, if we do decide to allow the tasks to fail, will some memory be reclaimed eventually ? i.e. will such future task attempts ever succeed ? or do we need to restart the process ?

siying · 2023-08-21T18:28:33Z

@anishshri-db the blocks in use cannot be spilled to disk. Those are objects that the higher level code fetched to use. They are supposed to access the contents using pointers to the memory. After it is used, they are returned to block cache. Those returned to block cache are evicted based on LRU. The insertion is rejected with an error only happens when all objects are being used already take all allocated capacity.

Whatever is returned to block cache can be evicted using LRU.

anishshri-db · 2023-08-21T18:39:48Z

@siying - Thanks for the explanation. If thats the case, then I think we might have to set this setting to false, atleast for now. Even if we fail this task, there is no guarantee that other blocks could be returned in time for eviction before other task attempts are made. Effectively this would mean that the query would fail (after 4 consecutive attempts) and we would have to restart and reinstantiate the state (which might be quite expensive). We might be better off with a soft limit to avoid this kind of scenario in the large majority of cases.

In the long term, we probably might need to address this by auto-scaling the cluster and moving partitions to other nodes such that the individual limit of a node is not reached for the allocated/assigned memory.

cc - @HeartSaVioR - thoughts ?

HeartSaVioR · 2023-08-21T22:05:11Z

I'd still like to understand how this is different from not capping the memory at all. Does capping the memory avoid RocksDB using the memory excessively? Or is there no difference between capping with soft limit vs no capping at all?

Also, there is another aspect to think of - OOM kills the executor which could affect all stateful, stateless, batch queries. This error will only affect stateful queries. If people intends to set the limit on RocksDB memory usage considering this fact, soft limiting would break the intention, although they may still need to restart the cluster or at least executor to apply the new setting of memory limit on RocksDB. Looks to be very tricky to adjust from users' point of view when the error happens...

Ideally we will need to rebalance the state if the memory hit happens, but maybe not happening in the short term.

anishshri-db · 2023-08-21T22:37:13Z

Does capping the memory avoid RocksDB using the memory excessively? Or is there no difference between capping with soft limit vs no capping at all?

I believe there is a difference right. Without the write buffer manager, memory is not accounted towards a single block cache but towards as many DB instances that are present on the node along with memory required for write buffers/other pinned blocks. With the soft limit, RocksDB can allocate more memory temporarily to serve the gets if the existing blocks are not returned to the cache. But once they are returned, some blocks should be evicted from the LRU cache.

Looks to be very tricky to adjust from users' point of view when the error happens

Not sure what you mean here. They might still need to restart the cluster to allow for updated memory limit values to take effect, correct ?

Ideally we will need to rebalance the state if the memory hit happens, but maybe not happening in the short term.

Yup

HeartSaVioR · 2023-08-21T22:59:45Z

Thanks for clarifying. So the capped memory usage could be temporarily exceeded but RocksDB will try hard to respect the requirement as soon as it can. That explains the enough difference on rationalization of the capped memory.

I'm OK with soft limit. I'd say hard limit is very useful to restrict the blast radius and the streaming query should just fail if it cannot live with hard limit despite of proper rebalancing of state. But given that we don't have a proper rebalancing of state, hard limiting may play as random failures depending on scheduling of stateful partitions.

We should probably document the behavior though, so that some users would be able to plan for some margin. Could you please update the doc?

They might still need to restart the cluster to allow for updated memory limit values to take effect, correct ?

Yes, that's what I meant.

HeartSaVioR · 2023-08-21T23:01:27Z

(Maybe conservative one is to have another conf to determine whether it should be soft or strict, but I know we have too many confs already...)

HeartSaVioR

+1

HeartSaVioR · 2023-08-23T04:12:37Z

Thanks! Merging to master.

[SPARK-44878] Disable strict limit for RocksDB write manager to avoid…

0ebe17a

… insertion exception on cache full

anishshri-db changed the title ~~[SPARK-44878] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full~~ [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full Aug 18, 2023

github-actions bot added SQL STRUCTURED STREAMING labels Aug 18, 2023

Change to address Jungtaek's comments

6e33f1c

github-actions bot added the DOCS label Aug 22, 2023

HeartSaVioR approved these changes Aug 23, 2023

View reviewed changes

HeartSaVioR closed this in a0c029c Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

anishshri-db commented Aug 18, 2023 •

edited

Loading

anishshri-db commented Aug 18, 2023

siying commented Aug 21, 2023

anishshri-db commented Aug 21, 2023

siying commented Aug 21, 2023

anishshri-db commented Aug 21, 2023 •

edited

Loading

HeartSaVioR commented Aug 21, 2023

anishshri-db commented Aug 21, 2023 •

edited

Loading

HeartSaVioR commented Aug 21, 2023

HeartSaVioR commented Aug 21, 2023

HeartSaVioR left a comment

HeartSaVioR commented Aug 23, 2023

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

Conversation

anishshri-db commented Aug 18, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

anishshri-db commented Aug 18, 2023

siying commented Aug 21, 2023

anishshri-db commented Aug 21, 2023

siying commented Aug 21, 2023

anishshri-db commented Aug 21, 2023 • edited Loading

HeartSaVioR commented Aug 21, 2023

anishshri-db commented Aug 21, 2023 • edited Loading

HeartSaVioR commented Aug 21, 2023

HeartSaVioR commented Aug 21, 2023

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Aug 23, 2023

anishshri-db commented Aug 18, 2023 •

edited

Loading

anishshri-db commented Aug 21, 2023 •

edited

Loading

anishshri-db commented Aug 21, 2023 •

edited

Loading