-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44878][SS] Disable strict limit for RocksDB write manager to avoid insertion exception on cache full #42567
Conversation
… insertion exception on cache full
cc - @HeartSaVioR , @siying - PTAL, thx |
When block cache in strict mode is full, we try to insert more data, some exception will be thrown one way of another. I don't think SS has a way to handle the error, and the consequence will be task failure. If that is not a behavior we can afford, we should disable strict mode. It is a trade-off here: either RocksDB reject queries and background tasks that require more memory than needed, or we use more memory than configured. RocksDB might not have a better way to handle it. It is a less used feature, and is mainly used in those applications that would rather to fail than using more configured memory. An example, sometimes people don't feel comfortable for an administrative demon to use more memory and crash a host that might serves online queries. I think it is a decision Spark SS needs to make. When users misconfigured memory cap that turns out to be not enough, are we going to fail their queries, or are we going to allow RocksDB to use more memory than needed and potentially cause an OOM. I can see both pros and cons on it. |
@siying - how does the cache eviction mechanism work though for RocksDB ? since we are accounting everything towards block cache, is it not possible to spill write buffer/block cache blocks to disk if we hit this cache full case ? Maybe we won't be able to evict pinned blocks, but what about other blocks ? Also, if we do decide to allow the tasks to fail, will some memory be reclaimed eventually ? i.e. will such future task attempts ever succeed ? or do we need to restart the process ? |
@anishshri-db the blocks in use cannot be spilled to disk. Those are objects that the higher level code fetched to use. They are supposed to access the contents using pointers to the memory. After it is used, they are returned to block cache. Those returned to block cache are evicted based on LRU. The insertion is rejected with an error only happens when all objects are being used already take all allocated capacity. Whatever is returned to block cache can be evicted using LRU. |
@siying - Thanks for the explanation. If thats the case, then I think we might have to set this setting to In the long term, we probably might need to address this by auto-scaling the cluster and moving partitions to other nodes such that the individual limit of a node is not reached for the allocated/assigned memory. cc - @HeartSaVioR - thoughts ? |
I'd still like to understand how this is different from not capping the memory at all. Does capping the memory avoid RocksDB using the memory excessively? Or is there no difference between capping with soft limit vs no capping at all? Also, there is another aspect to think of - OOM kills the executor which could affect all stateful, stateless, batch queries. This error will only affect stateful queries. If people intends to set the limit on RocksDB memory usage considering this fact, soft limiting would break the intention, although they may still need to restart the cluster or at least executor to apply the new setting of memory limit on RocksDB. Looks to be very tricky to adjust from users' point of view when the error happens... Ideally we will need to rebalance the state if the memory hit happens, but maybe not happening in the short term. |
I believe there is a difference right. Without the write buffer manager, memory is not accounted towards a single block cache but towards as many DB instances that are present on the node along with memory required for write buffers/other pinned blocks. With the soft limit, RocksDB can allocate more memory temporarily to serve the gets if the existing blocks are not returned to the cache. But once they are returned, some blocks should be evicted from the LRU cache.
Not sure what you mean here. They might still need to restart the cluster to allow for updated memory limit values to take effect, correct ?
Yup |
Thanks for clarifying. So the capped memory usage could be temporarily exceeded but RocksDB will try hard to respect the requirement as soon as it can. That explains the enough difference on rationalization of the capped memory. I'm OK with soft limit. I'd say hard limit is very useful to restrict the blast radius and the streaming query should just fail if it cannot live with hard limit despite of proper rebalancing of state. But given that we don't have a proper rebalancing of state, hard limiting may play as random failures depending on scheduling of stateful partitions. We should probably document the behavior though, so that some users would be able to plan for some margin. Could you please update the doc?
Yes, that's what I meant. |
(Maybe conservative one is to have another conf to determine whether it should be soft or strict, but I know we have too many confs already...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks! Merging to master. |
What changes were proposed in this pull request?
Disable strict limit for RocksDB write manager to avoid insertion exception on cache full
Why are the changes needed?
In some cases, if the memory limit is reached, on insert/get, we are seeing the following exception
It seems this is being thrown with strict memory limit within RocksDB here - https://github.com/facebook/rocksdb/blob/0fa0c97d3e9ac5dfc2e7ae94834b0850cdef5df7/cache/lru_cache.cc#L394
It seems this issue can only happen with the strict mode as described here - facebook/rocksdb#5048 (comment)
Seems like there is a pending issue for RocksDB around this as well - facebook/rocksdb#8670
There is probably a relevant fix, but not sure whether this addresses the issue completely - facebook/rocksdb#6619
(cc - @siying )
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests