Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent FdData leaks in getNotification #99

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

velveteer
Copy link

The rationale for this change is a memory leak observed in a production
application using postgresql-simple-0.6.4. The related code is
proprietary but I will try to give the gist of the issue. If a minimal
repro is requested I could put one together.

When getNotification is blocking in an async-spawned thread, upon
cancelling and restarting the thread in a loop (reusing the same
connection) a heap profile shows an increasing amount of FdData
closures building up (along with a few other closures, e.g. the TVar Nothing created in threadWaitSTM).

Forgive my limited knowledge of
the GHC EventManager, but it seems that the registered callback made
by threadWaitReadSTM does not get removed from the EventManager
state when waitRead (I presume) is interrupted by the
AsyncCancelled.

This change somewhat mirrors how the non-STM threadWait handles
exceptions, and so I think it should be benign. I've tested this change
against our application and confirmed the FdData and related closures
are no longer hanging around.

The rationale for this change is a memory leak observed in a production
application using `postgresql-simple-0.6.4`. The related code is
proprietary but I will try to give the gist of the issue. If a minimal
repro is requested I could put one together.

When `getNotification` is blocking in an `async`-spawned thread, upon
cancelling and restarting the thread in a loop (reusing the same
connection) a heap profile shows an increasing amount of `FdData`
closures building up (along with a few other closures, e.g. the `TVar
Nothing` created in `threadWaitSTM`). Forgive my limited knowledge of
the GHC `EventManager`, but it seems that the registered callback made
by `threadWaitReadSTM` does not get removed from the `EventManager`
state when `waitRead` (I presume) is interrupted by the
`AsyncCancelled`.

This change somewhat mirrors how the non-STM `threadWait` handles
exceptions, and so I think it should be benign. I've tested this change
against our application and confirmed the `FdData` and related closures
are no longer hanging around.
return $ do
atomically waitRead `catch` (throwIO . setIOErrorLocation)
mapException setIOErrorLocation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not

atomically waitRead `catch` \e -> do
  unregister
  throwIO (setIOErrorLocation e)

or can unregister itself throw? Even then I wouldn't use mapException as it uses unsafePerformIO, and here it's not required.

Copy link
Author

@velveteer velveteer Sep 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested your suggestion but saw no change in the heap, I think because it's not catching the async exception. I can still try to avoid mapException if you think it's that contentious (even though its documentation suggests it's safe).

@phadej
Copy link
Collaborator

phadej commented Sep 4, 2022

Forgive my limited knowledge of
the GHC EventManager, but it seems that the registered callback made
by threadWaitReadSTM does not get removed from the EventManager
state when waitRead (I presume) is interrupted by the
AsyncCancelled.

Have you asked on GHC issue tracker, that sounds like a bug in EventManager.
(It's good that we can workaround it, but better to also fix the issue at the source).

@phadej
Copy link
Collaborator

phadej commented Sep 4, 2022

Also reproducer would be nice.

If I understand right, just doing getNotication in async and canceling it for a million times should not make RTS use a lot memory, but apparently it will? A standalone program doing that would be enough (i.e. to test with +RTS -s manually).

@phadej
Copy link
Collaborator

phadej commented Sep 4, 2022

I also don't understand what we get by using threadWaitReadSTM here, the comments don't make sense to me.
(The fd can still change if the connection is renewed in between). I have to look through the history.,

Could you try your application with threadWaitRead branch, i.e. removing the #if block

@velveteer
Copy link
Author

I also don't understand what we get by using threadWaitReadSTM here, the comments don't make sense to me. (The fd can still change if the connection is renewed in between). I have to look through the history.,

Could you try your application with threadWaitRead branch, i.e. removing the #if block

Indeed, avoiding threadWaitReadSTM altogether resolves the issue.

Here's a minimal reproduction: https://github.com/velveteer/postgresql-simple-leak-repro

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants