-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory not released in channel read (maybe :) ) #174
Comments
Here is the bit of code I am using to call the driver:
Note that I always call |
woof. that's not good :) yea lemme try to see if I can reproduce and see what's up. maybe you can try to just |
I tried setting the b to nil in channel.read, but it didn't solve the issue. I think maybe because (For sake of completeness, I think rb should also be set to nil in the driver: https://github.com/scrapli/scrapligo/compare/main...netixx:scrapligo:fix/b-memory-usage?expand=1) |
you've got good timing - was just going to report that the nil didn't change anything for my testing too! but I also didn't seem to really reproduce things either. running for 10m getting config every 1s I ended up w/ ~ 0.3MB usage -- when I tested shorter duration it was about the same as well which seems generally "good" but confusing in this context I think. will check out the rb/replace all stuff now as well. in both the graphs there is like > 10 minutes where memory is totally chill. is there nothing going on in that part? what happens at that 10 minute mark? is there anything else in this container? Also, seems like in my case the memory is getting chewed up in the regex bits (which feels reasonable) but only in the NETCONF driver read not in the channel bits. |
There are indeed other things in the container, mainly a web server (gin). The app receives query every 3 hours, once the queries are handled, it more or less sits there waiting for the next few hours (I have another app that does refresh every 3 hours for inventory purposes). Which means that a major part of the memory is being used by the read function (but really by the bytes.Replace function), but I expect those to be release once the queries are done, especially since I do a conn.Close() after each query. I built scrapligo with
This one caught my attention:
I don't see any I also was thinking that this could be an issue: https://stackoverflow.com/questions/34436863/can-this-type-of-golang-string-slicing-leak-memory-in-underlying-byte-array. But since we expect to totally clear the By the way, it does run OOM:
even though I made a small test case here: https://github.com/netixx/scrapligo/blob/fix/b-memory-usage/channel/read_test.go
We can see that the number of allocated objects in increasing at each call (not 100% sure what it shows yet, maybe a Benchmark with -benchmem would be better, but it shows allocs and not actual memory in use). |
@netixx what Go version are you building this with? |
|
I am just throwing things at the problem now, but maybe also run https://github.com/uber-go/goleak to see if there are goroutines that are leaked? |
I tried, there does not seem to be any. I added heap profile dumps before and after running the test, and used Actually, I couldn't find anything wrong with the channel/read, I run heap profile dump before and after and compare the two, and there was no diff... So, I tested on driver/netconf/getconfig, and there I could see additional heap data allocated after the test was finished. However, one I added the defer clause here:
I couldn't see any inuse space difference anymore. So it leads me to think that my code maybe doesn't always call
looks good to me. Or should I call |
Defer won't be called if an error is returned. So you can use a proper return instead of a defer when you have to close even when the error is present |
on this last part (the err/close bits) I think there is defo a possibility (at the moment) of the transport being left open there. easy fix, and in fact we sorta handled this in channel but not well enough I think :)... and it needs to happen in all the driver places too as there is extra stuff that happens after opening channel/transport (on open funcs mostly) where we can fail and a defer'd close call would probably never get hit like in this case. do we know if this last bit is related to the main issue though? does the app just fail to open a few times and then leave dangling connections or is this unrelated/just a new fun thing we found here :) ? |
It's my assumption that Regarding the origin of the issue:
Couple of errors I get (from the logs) is If it can help, I captured goroutines usage while the app was sleeping (in between queries),
but it's only a small number, and those:
(that's 38/105 goroutines, so it seems more relevant). |
ah, sorry, sorry, @netixx |
ok nice. thank you so much for all this info! this is pretty interesting stuff :D (though I wish it "just worked" instead!) I just pushed this which I think wont "fix" things but cleans up some other stuff that all this brought to light. a possible fix that I hate but may be the easiest/best fix is to use standard transport. system transport blocks forever on reads (fd.Read()) -- so we have a force close that ignores the lock around it and closes the fd but I dont think that actually stops the read because I think that the file descriptor doesn't support "deadlines" (docs say when you close a os.File that deadlines are set if supported and that the act of closing will auto stop any in flight reads, but doesn't seem that that happens for us). so, my hope/thought is/was that once the connection and transport go out of scope they'd get gc'd but im not sure if that actually happens. so... testing w/ standard would be good if possible since that could "just work" then that would confirm that system is for sure the problem which and maybe give you a good path forward if that works for you. longer term it would be nice to fix system of course if this is really an issue -- though I'm not sure it is? not ideal certainly, but is one (or even many) reads that have nothing to read a problem? I think the answer is obviously "yes" its not great, and if we can fix it of course we should but im just unsure if its actually impactful/meaingfull. |
I will test your fix right away and report back. At least if we kill the transport (even if the channel remains stuck), then we resolve the memory issue (but we have a goroutine leak still). The reason we use the system transport, is because we need to go through a custom jumphost (i.e. not only ssh forward) https://github.com/ovh/the-bastion. It can definitely work with Can we bring our own transport to Scrapligo ? I was thinking something in the likes of Side note: |
The changed you proposed didn't fix my issue, I still have goroutines running away with their RAM :) I did a little bit more digging, it looks like the pty lib has identified this issue: creack/pty#167, however they reverted the fix as of now (see creack/pty#174), so I don't know if we'll find a solution there... But I implemented a test to reproduce the issue (and I think I succeeded), and following that I tried the following fix: https://github.com/netixx/scrapligo/blob/4020535678b0f0158d8b1f149020c28bd5f9ec42/transport/system.go#L211, that is, killing the SSH command when closing the transport (my assumption is that this in turn will shutdown everything). Sad thing though is that I still have a lot of goroutines stuck on the read (and memory not released)... |
heh, yeah didn't figure that branch would fix this, but still, some hopefully fixes that came from this! :D killing it violently like this seems like an obvious thing to do now that you say it! funny how that works eh? hah. ok cool, lemme cherrypick that and dump it on the other branch if that works for ya!? then to recap where we are at this point: we nicely will now (with your fix!) close the session so that is good. but we still hav ego routines hanging somehow/somewhere. I guess I would assume/hope/expect that would free the reads of the transport but... guess not?! I'll read those creak issues too. thanks so much @netixx you are the best (sorry roman, new champ in town) |
copy! will check this out more tomorrow probably and check out updated branch! |
My guess is that the goroutine in stuck here:
which is consistent with the log message I am getting. because the channel needs to be read by I just realised that I didn't base my fix on your branch (must have fumbled my git commands). I will try a branch with both fixes (that should solve this IMO). |
Ok, so I was able to test a combination of my "fix" and your branch, and I can report that I no longer have goroutine stuck in read (neither transport, nor channel). https://github.com/netixx/scrapligo/tree/test/system-transport-close-plus I still have a couple of gouroutines stuck here: That's this piece of code:
I guess it needs to stop immediately when the driver is closed (and not block on done <- data) ? |
nice!! ok great -- I think I can get that one sorted today/tomorrow too. is that branch I cherry picked your previous fix on up to date with all your fixes? or can ya make a pr to that to add whatever else its missing maybe? then ill fix this message id one and we should be good to go! |
System transport can get stuck on the Read call because of pty implementation details (see scrapli#174) for more details. This makes sure that we kill the SSH process when the transport is closed, and in turn makes sure that the rest of the components (channel, drivers) can shutdown properly. I also added a test simulating a blocking (stuck) SSH server and a non blokcing SSH server (non blocking showed that it's fine to kill an already defunct process). Signed-off-by: Francois Espinet <[email protected]>
This is just my test branch for reference. I tested the Kill stuff, and it's safe even if the SSH process is already dead. I added a test for that. Here is the PR on top of your existing fixes: #177 There is an issue with the crypto/ssh version in go.mod. |
awesome! yeah there is some reason I pinned back to old crypto... I dont remember what it is though :D oh look at my elaborate notes in go.mod haha:
helpful past carl... will check all this out and get stuff merged today/tomorrow and will cut new release once we are happy too. thanks so much for all the work!! |
Your welcome, glad to do it to support the best github network project ❤️ |
I merged all our combined stuff into main so I think we solved it!! huge thanks to ya'll for this work! |
Hello,
I am running a REST API that runs netconf in the backend.
I am investigating an issue where my app does not release a big chunk of it's memory.
I used profiling to identify what is holding the memory, and I got the following:
It seems that the
read
function is holding a lot of memory, event though there are no more queries being processed. I am not sure why it doesn't get cleaned up by the garbage collector though. Profiling suggests that it's somewhere around here:scrapligo/channel/read.go
Line 43 in ed01781
My first guess is that after
c.Q.Enqueue(b)
b
should be wiped ?Do you have any idea why ?
Would you have some time to troubleshoot together ?
The text was updated successfully, but these errors were encountered: