-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement] Optimize FieldDataCache removal flow #13862
Comments
@sgup432 - Thank you for describing the issue and potential solution. Few considerations:
|
@sohami @msfroh @Bukhtawar - Is there any specific reason for field data cache cleanup happening on the cluster applier thread? |
Yeah even I thought about it. I didn't see any reason as to why we should fail data node's cluster state update logic during index removal in case underlying fieldDataCache clean up fails. Considering index was already removed from the node, worst case would be stale entries lying in the cache in case we just swallowed up fieldCache cleanup exception. So thought doing this in an async manner shouldn't be a problem as such IMO. But would like to hear other opinions as well as I might be wrong in my understanding. Plus seems like field data cache cleanup won't likely throw any exception considering internal cache.invalidate seems safe even if a key is not present. |
Any mutating operations to cluster state, including index creation/deletion needs to be applied to the local data structures of the node applying the state. Generally these appliers are processed sequentially and in a blocking manner to ensure that all local structures are successfully refreshed before a cluster state commit acknowledgement can be sent back. I think we need to holistically look into this problem and start with a mechanism like soft-eviction or soft-deletes which just marks the entry as deleted or stale while the actual clean-up can happen in the background |
Is your feature request related to a problem? Please describe
FieldDataCache is a node level cache and uses a composite key (indexFieldCache + shardId + IndexReader.CacheKey) to uniquely identify an item. IndexFieldCache further contains fieldName and index level details.
As of today, any item in fieldDataCache is removed in blocking/sync manner. This happens in below scenarios:
Problem
We already have had issues where during index removal, data node dropped as a lot of time/cpu was taken up in clearing up fieldDataCache.
Scenario(observed in production): Cluster manager node sends cluster state update task to data node on index removal. Data node starts clearing up fieldData cache on same clusterStateApplier(clusterApplierService#updateTask) thread, taking a lot of time(due to large cache size and inefficient all key traversal) and unable to acknowledge back to cluster manager node. This eventually resulted in this data node being removed from cluster.
Sample hot thread dump observed
Describe the solution you'd like
As a solution, I suggest we should do following:
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: