Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Remote Purge threadpool taking too much memory in case of too much deleted indices cleanup #12253

Closed
gbbafna opened this issue Feb 8, 2024 · 4 comments
Assignees
Labels
bug Something isn't working Storage:Remote v2.14.0

Comments

@gbbafna
Copy link
Collaborator

gbbafna commented Feb 8, 2024

Describe the bug

image (1)(1)

We use Remote Purge threadpool to delete segments data for deleted indices in shallow snapshots. When the number of such indices are huge, as well as the count of snapshots are huge , we see a pile up of Remote Purge threadpool . In the above heap dump , we can see 30 million instance of Remote Purge threads hogging up the memory of around 30 GB.

Related component

Storage:Remote

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

Remote Purge threadpool should be bounded .

Shallow Snapshot Deletion also needs to be smarter to handle this deletion in a scalable way .

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@harishbhakuni
Copy link
Contributor

harishbhakuni commented Feb 13, 2024

Thanks for creating this issue @gbbafna,
Also one more optimization i can think of is that in the close() method of RemoteSegmentStoreDirectory class

public void close() throws IOException {
deleteStaleSegmentsAsync(0, ActionListener.wrap(r -> deleteIfEmpty(), e -> logger.error("Failed to cleanup remote directory")));

we are cleaning up one segment file at a time which is followed by cleaning up corresponding md file and then at the end we are cleaning up the directories. Since we already know that shard is being closed after deletion we can instead directly cleanup the directories using BlobContainer.delete() which would internally use batch deletion in most of the repository implementations to cleanup the individual objects.

Let me know if this makes sense. will raise a draft PR for this in sometime along with some snapshot deletion side optimizations.

@peternied
Copy link
Member

Let me know if this makes sense.

@harishbhakuni This sounds like a solid mitigation that will reduce the overhead when running into this issue. I think a draft PR would be a great next step if you can spin one up.

@gbbafna gbbafna moved this from 🆕 New to 👀 In review in Storage Project Board Mar 4, 2024
@rramachand21 rramachand21 added the v2.13.0 Issues and PRs related to version 2.13.0 label Mar 8, 2024
@gbbafna gbbafna added v2.14.0 and removed v2.13.0 Issues and PRs related to version 2.13.0 labels Apr 4, 2024
@ashking94
Copy link
Member

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12]

@harishbhakuni The linked PR is closed. Will there be further PRs or this can be closed?

@harishbhakuni
Copy link
Contributor

Hi @ashking94 , this issue can be closed.

@github-project-automation github-project-automation bot moved this from 👀 In review to ✅ Done in Storage Project Board Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote v2.14.0
Projects
Status: ✅ Done
Development

No branches or pull requests

6 participants