[BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request #13524

anshu1106 · 2024-05-03T06:56:04Z

Describe the bug

While analyzing a heap dump taken on a domain with large no. nodes and 200k shards, it is found that out of 16.1 GB, ~14.7 GB in the retained heap is due to TransportResponseHandlers. The dump is from a data node and there were _bulk queries running in the domain at the time when heap dump was captured.

Expanding TransportResponseHandler

On expanding a ConcurrentHashMap object, it is found that TransportBulkAction$ConcreteIndices is taking ~63 MB. Most of which is taken by ClusterState.

OpenSearch/server/src/main/java/org/opensearch/action/bulk/TransportBulkAction.java

Line 829 in 5e72e1d

private final ClusterState state;

The histogram below shows 215 ClusterState object taking ~11 GB of heap.

The incoming object reference for most of the ClusterState object is TransportBulkAction$ConcreteIndices.

There seem to be a bug in TransportBulkAction path which is creating new ClusterState objects rather than referencing one.

Related component

Indexing:Performance

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

There has to be atmost 2 cluster state objects in the domain when updates are going on. TransportBulkAction should not create new ClusterState objects.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

shwetathareja · 2024-05-03T07:04:11Z

Thanks @anshu1106 for filing this issue. It is an interesting one.

shwetathareja · 2024-05-03T07:10:04Z

This looks like due to dynamic mapping, cluster state is changing often and hence that many different objects are present. One thing we should evaluate is instead of passing the whole cluster state what we sub object (indicesLookUp ?) we can pass to ConcreteIndices constructor so that retained heap is not so much until the bulk request is processed.

anshu1106 added bug Something isn't working untriaged labels May 3, 2024

github-actions bot added the Indexing:Performance label May 3, 2024

shwetathareja removed the untriaged label May 3, 2024

shwetathareja changed the title ~~[BUG] Multiple Cluster state objects found in data node's heap snapshot~~ [BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request #13524

[BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request #13524

anshu1106 commented May 3, 2024 •

edited by shwetathareja

Loading

shwetathareja commented May 3, 2024

shwetathareja commented May 3, 2024

[BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request #13524

[BUG] Multiple Cluster state objects found in data node's heap snapshot for bulk request #13524

Comments

anshu1106 commented May 3, 2024 • edited by shwetathareja Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

shwetathareja commented May 3, 2024

shwetathareja commented May 3, 2024

anshu1106 commented May 3, 2024 •

edited by shwetathareja

Loading