Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix high cpu usage caused by fd leak #1581

Merged
merged 1 commit into from
May 27, 2023

Conversation

aftersnow
Copy link

We found a problem of high CPU usage of the supervisor. We believe it's same reason for #807 . This problem is caused by continuous polling of a wrong fd in the main loop of the supervisor. Busy polling leads to a CPU usage close to 100%. (We can confirm this problem through the strace tool)

This issue can be reproduced by:

  1. Continuously initiate arbitrary requests to supervisor through supervisorctl
  2. After the socket fd is closed, trigger the supervisor's subprocess to rotate the log (or reopen the file)
  3. If the above steps are completed within a single main loop of the supervisor, the problem can be triggered

The reason for the problem is that supervisor relies on using _ignore_invalid() in the main loop to close fds. This method has a flaw that if fd is reused before _ignore_invalid() is called, then the fd may always exist in the fd list of poll .

This commit fixes the problem. By checking the validity of the fd in the event list in the main loop, if the fd is not in the combined_map, it is considered to be an invalid fd and will be removed from the list.

We found a problem of high CPU usage of the supervisor. This problem is
caused by continuous polling of a wrong fd in the main loop of the
supervisor. Busy polling leads to a CPU usage close to 100%. (We can
confirm this problem through the strace tool)

This issue can be reproduced by:
1. Continuously initiate arbitrary requests to supervisor through
supervisorctl
2. After the socket fd is closed, trigger the supervisor's subprocess
to rotate the log (or reopen the file)
3. If the above steps are completed within a single main loop of the
supervisor, the problem can be triggered

The reason for the problem is that supervisor relies on using
_ignore_invalid() in the main loop to close fds. This method has a flaw
that if fd is reused before _ignore_invalid() is called, then the fd may
always exist in the fd list of poll .

This commit fixes the problem. By checking the validity of the fd in the
event list in the main loop, if the fd is not in the combined_map, it is
considered to be an invalid fd and will be removed from the list.
@justinpryzby
Copy link

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

@aftersnow
Copy link
Author

Thanks. This looks promising, and I've deployed it a few places to test.

The warnings you added my be as important as the fix itself - if we had that 3 years ago, the bug probably would've been a lot more obvious. Are there any other warnings that should be added ? For example, what about when an FD that isn't a pipe ends up on the list of FDs to be polled, as I've also seen. I suspect there are more bugs with FDs, which may be rare and hard to hit, but it'll be amply easy to address them if logs are added to warn about inconsistencies.

Yes, more warnings is needed, but what important is we need to unregister the FD from polling list after each event is handled (if it's need to), instead of unregistering FD by _ignore_invalid().

I didn't understand what did you meant when you said "trigger the supervisor's subprocess to rotate the log" ? Do you mean by connecting to that process and causing it to write an adequately large logs to stdout ? Is it possible to consistently make the bug easier to hit by injecting a "sleep" command ?

Yes, but the faster method is to use a shared variable or signal to reproduce:

  1. When the supervisorctl socket FD is closed, change the variable
  2. When handling POutputDispatcher's read event, the above variable tells the logger to rotate log file, this will cause supervisor to reopen file and get a new FD. The FD must same with the closed socket FD.
  3. Issue reproduced

@ollofx

This comment was marked as outdated.

@mnaberez

This comment was marked as outdated.

@mnaberez mnaberez mentioned this pull request Mar 27, 2023
@justinpryzby
Copy link

I've deployed this change to customers and saw no issues since last month.
Thanks to @aftersnow for diagnosing the issue.

@Tomo59
Copy link
Contributor

Tomo59 commented May 23, 2023

Hello, I deployed also this change and I confirm it fixes the issue.

Do you know when this will be merged ?

@mnaberez mnaberez merged commit 2a93d6b into Supervisor:main May 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants