Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567

Closed
wants to merge 2 commits into from

Conversation

anishshri-db
Copy link
Contributor

@anishshri-db anishshri-db commented Aug 18, 2023

What changes were proposed in this pull request?

Disable strict limit for RocksDB write manager to avoid insertion exception on cache full

Why are the changes needed?

In some cases, if the memory limit is reached, on insert/get, we are seeing the following exception

org.apache.spark.SparkException: Job aborted due to stage failure: Task 42 in stage 9.0 failed 4 times, most recent failure: Lost task 42.3 in stage 9.0 (TID 2950) (96.104.176.55 executor 0): org.rocksdb.RocksDBException: Insert failed due to LRU cache being full.
	at org.rocksdb.RocksDB.get(Native Method)
	at org.rocksdb.RocksDB.get(RocksDB.java:2053)
	at org.apache.spark.sql.execution.streaming.state.RocksDB.get(RocksDB.scala:299)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$RocksDBStateStore.get(RocksDBStateStoreProvider.scala:55)

It seems this is being thrown with strict memory limit within RocksDB here - https://github.com/facebook/rocksdb/blob/0fa0c97d3e9ac5dfc2e7ae94834b0850cdef5df7/cache/lru_cache.cc#L394

It seems this issue can only happen with the strict mode as described here - facebook/rocksdb#5048 (comment)

Seems like there is a pending issue for RocksDB around this as well - facebook/rocksdb#8670

There is probably a relevant fix, but not sure whether this addresses the issue completely - facebook/rocksdb#6619
(cc - @siying )

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests

@anishshri-db anishshri-db changed the title [SPARK-44878] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full [SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full Aug 18, 2023
@anishshri-db
Copy link
Contributor Author

cc - @HeartSaVioR , @siying - PTAL, thx

@siying
Copy link
Contributor

siying commented Aug 21, 2023

When block cache in strict mode is full, we try to insert more data, some exception will be thrown one way of another. I don't think SS has a way to handle the error, and the consequence will be task failure. If that is not a behavior we can afford, we should disable strict mode.

It is a trade-off here: either RocksDB reject queries and background tasks that require more memory than needed, or we use more memory than configured. RocksDB might not have a better way to handle it. It is a less used feature, and is mainly used in those applications that would rather to fail than using more configured memory. An example, sometimes people don't feel comfortable for an administrative demon to use more memory and crash a host that might serves online queries.

I think it is a decision Spark SS needs to make. When users misconfigured memory cap that turns out to be not enough, are we going to fail their queries, or are we going to allow RocksDB to use more memory than needed and potentially cause an OOM. I can see both pros and cons on it.

@anishshri-db
Copy link
Contributor Author

@siying - how does the cache eviction mechanism work though for RocksDB ? since we are accounting everything towards block cache, is it not possible to spill write buffer/block cache blocks to disk if we hit this cache full case ? Maybe we won't be able to evict pinned blocks, but what about other blocks ?

Also, if we do decide to allow the tasks to fail, will some memory be reclaimed eventually ? i.e. will such future task attempts ever succeed ? or do we need to restart the process ?

@siying
Copy link
Contributor

siying commented Aug 21, 2023

@anishshri-db the blocks in use cannot be spilled to disk. Those are objects that the higher level code fetched to use. They are supposed to access the contents using pointers to the memory. After it is used, they are returned to block cache. Those returned to block cache are evicted based on LRU. The insertion is rejected with an error only happens when all objects are being used already take all allocated capacity.

Whatever is returned to block cache can be evicted using LRU.

@anishshri-db
Copy link
Contributor Author

anishshri-db commented Aug 21, 2023

@siying - Thanks for the explanation. If thats the case, then I think we might have to set this setting to false, atleast for now. Even if we fail this task, there is no guarantee that other blocks could be returned in time for eviction before other task attempts are made. Effectively this would mean that the query would fail (after 4 consecutive attempts) and we would have to restart and reinstantiate the state (which might be quite expensive). We might be better off with a soft limit to avoid this kind of scenario in the large majority of cases.

In the long term, we probably might need to address this by auto-scaling the cluster and moving partitions to other nodes such that the individual limit of a node is not reached for the allocated/assigned memory.

cc - @HeartSaVioR - thoughts ?

@HeartSaVioR
Copy link
Contributor

I'd still like to understand how this is different from not capping the memory at all. Does capping the memory avoid RocksDB using the memory excessively? Or is there no difference between capping with soft limit vs no capping at all?

Also, there is another aspect to think of - OOM kills the executor which could affect all stateful, stateless, batch queries. This error will only affect stateful queries. If people intends to set the limit on RocksDB memory usage considering this fact, soft limiting would break the intention, although they may still need to restart the cluster or at least executor to apply the new setting of memory limit on RocksDB. Looks to be very tricky to adjust from users' point of view when the error happens...

Ideally we will need to rebalance the state if the memory hit happens, but maybe not happening in the short term.

@anishshri-db
Copy link
Contributor Author

anishshri-db commented Aug 21, 2023

Does capping the memory avoid RocksDB using the memory excessively? Or is there no difference between capping with soft limit vs no capping at all?

I believe there is a difference right. Without the write buffer manager, memory is not accounted towards a single block cache but towards as many DB instances that are present on the node along with memory required for write buffers/other pinned blocks. With the soft limit, RocksDB can allocate more memory temporarily to serve the gets if the existing blocks are not returned to the cache. But once they are returned, some blocks should be evicted from the LRU cache.

Looks to be very tricky to adjust from users' point of view when the error happens

Not sure what you mean here. They might still need to restart the cluster to allow for updated memory limit values to take effect, correct ?

Ideally we will need to rebalance the state if the memory hit happens, but maybe not happening in the short term.

Yup

@HeartSaVioR
Copy link
Contributor

Thanks for clarifying. So the capped memory usage could be temporarily exceeded but RocksDB will try hard to respect the requirement as soon as it can. That explains the enough difference on rationalization of the capped memory.

I'm OK with soft limit. I'd say hard limit is very useful to restrict the blast radius and the streaming query should just fail if it cannot live with hard limit despite of proper rebalancing of state. But given that we don't have a proper rebalancing of state, hard limiting may play as random failures depending on scheduling of stateful partitions.

We should probably document the behavior though, so that some users would be able to plan for some margin. Could you please update the doc?

They might still need to restart the cluster to allow for updated memory limit values to take effect, correct ?

Yes, that's what I meant.

@HeartSaVioR
Copy link
Contributor

(Maybe conservative one is to have another conf to determine whether it should be soft or strict, but I know we have too many confs already...)

@github-actions github-actions bot added the DOCS label Aug 22, 2023
Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants