[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

gbbafna · 2024-02-08T11:46:01Z

Describe the bug

We use Remote Purge threadpool to delete segments data for deleted indices in shallow snapshots. When the number of such indices are huge, as well as the count of snapshots are huge , we see a pile up of Remote Purge threadpool . In the above heap dump , we can see 30 million instance of Remote Purge threads hogging up the memory of around 30 GB.

Related component

Storage:Remote

To Reproduce

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior

Remote Purge threadpool should be bounded .

Shallow Snapshot Deletion also needs to be smarter to handle this deletion in a scalable way .

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

harishbhakuni · 2024-02-13T16:30:01Z

Thanks for creating this issue @gbbafna,
Also one more optimization i can think of is that in the close() method of RemoteSegmentStoreDirectory class

OpenSearch/server/src/main/java/org/opensearch/index/store/RemoteSegmentStoreDirectory.java

Lines 878 to 879 in 76ae14a

    
           public void close() throws IOException { 
        
               deleteStaleSegmentsAsync(0, ActionListener.wrap(r -> deleteIfEmpty(), e -> logger.error("Failed to cleanup remote directory")));

we are cleaning up one segment file at a time which is followed by cleaning up corresponding md file and then at the end we are cleaning up the directories. Since we already know that shard is being closed after deletion we can instead directly cleanup the directories using BlobContainer.delete() which would internally use batch deletion in most of the repository implementations to cleanup the individual objects.

Let me know if this makes sense. will raise a draft PR for this in sometime along with some snapshot deletion side optimizations.

peternied · 2024-02-14T10:54:21Z

Let me know if this makes sense.

@harishbhakuni This sounds like a solid mitigation that will reduce the overhead when running into this issue. I think a draft PR would be a great next step if you can spin one up.

ashking94 · 2024-04-18T15:49:09Z

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@harishbhakuni The linked PR is closed. Will there be further PRs or this can be closed?

harishbhakuni · 2024-04-30T16:01:16Z

Hi @ashking94 , this issue can be closed.

gbbafna added bug Something isn't working untriaged labels Feb 8, 2024

gbbafna self-assigned this Feb 8, 2024

gbbafna removed the untriaged label Feb 8, 2024

github-actions bot added Storage:Remote untriaged labels Feb 8, 2024

gbbafna removed the untriaged label Feb 8, 2024

gbbafna mentioned this issue Feb 8, 2024

[Remote Store] Change remote purge threadpool to fixed instead of scaling to limit i… #12247

Closed

8 tasks

harishbhakuni mentioned this issue Feb 14, 2024

Optimize remote store operations during snapshot Deletion #12319

Merged

8 tasks

Bukhtawar added the Storage-Lifecycle label Feb 15, 2024

github-project-automation bot added this to Storage Project Board Feb 15, 2024

github-project-automation bot moved this to 🆕 New in Storage Project Board Feb 15, 2024

Bukhtawar removed the Storage-Lifecycle label Feb 15, 2024

gbbafna moved this from 🆕 New to 👀 In review in Storage Project Board Mar 4, 2024

rramachand21 added the v2.13.0 Issues and PRs related to version 2.13.0 label Mar 8, 2024

gbbafna added v2.14.0 and removed v2.13.0 Issues and PRs related to version 2.13.0 labels Apr 4, 2024

harishbhakuni closed this as completed Apr 30, 2024

github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

gbbafna commented Feb 8, 2024

harishbhakuni commented Feb 13, 2024 •

edited

Loading

peternied commented Feb 14, 2024

ashking94 commented Apr 18, 2024

harishbhakuni commented Apr 30, 2024

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

Comments

gbbafna commented Feb 8, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

harishbhakuni commented Feb 13, 2024 • edited Loading

peternied commented Feb 14, 2024

ashking94 commented Apr 18, 2024

harishbhakuni commented Apr 30, 2024

harishbhakuni commented Feb 13, 2024 •

edited

Loading