-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Waiting for tunxxx to become free #18
Comments
Thanks for the report! It's probably the same as #17 |
Just had an instance of #16 , where all processes died
Of the 10 running processes 8 were restarted by systemd just fine. Two of them are in an uninterruptable sleep (D), and those are the two processes where the warning is logged. |
These are the last messages of a broken server
|
It's very possible that this issue was solved by the latest changes we implemented. Any chance you could try to reproduce this one again with openvpn-2.6.2 and the latest master for ovpn-dco? Thanks a lot |
This is still happening with 2.6.2 and the new DCO
Happened on openvpn stop, process is in D state and unkillable
|
@bernhardschmidt thanks for reporting. The process is probably stuck on the iface-delete call, which won't return, because the networking stack is busy waiting for tunxxx to be released. Which probably is not going to happen. This is due to some refcounting unbalance... Did it happen with TCP or UDP? |
Good catch, the two hanging processes (out of 10) are the ones listening to TCP. |
I'm also encountering the issue with the hanging interface.
after that systemd was complaining:
Subsequently, I wasn't able to start, since:
Looking at kern.log, I found:
|
@mr-liusg this is nott he same issue. The messages you see are about incoming traffic for a peer that is not known. |
I can reproduce this issue in a stable way on my machine:
I added some dmesg logssuccessful
failure
|
Oh, this is exciting news. Especially having instructions how to reproduce this is really really good. (My testing so far focused on heavy client connect/disconnect churn, which did trigger a few issues :-) - but yours seems to be a race condition between "(heavy) traffic inside the VPN" and "peer being torn down", which I did not specifically test yet) |
Yes, just one client is enough to reproduce. |
Super exciting! will give it a try today! |
Is it related to
The refcount.log is the dmesg log. |
most likely yes. there must be some non-trivial imbalance that is triggered by a race condition.. |
This is not an issue per se. when _hold() is called and refcount is 0, the reference is simply not obtained because the peer is considered gone. This is why there is no put. refcount in this case stays at zero because the object has reached end of life |
Mhh.. we have an ovpn_peer_hold() in ovpn_tcp_read_sock() where we don't check the return value. Which is what is happening later in your log at line 5739 and indeed confirmed by line 5758:
Now the thing is: we have that _hold() without return value check because we assume that at that point in time we must be already holding a ref to that peer. We should not be there if we don't hold a reference. |
The platform I'm using above is a router with an MT7621A CPU, I just tried testing the same way on a x64 VM but haven't reproduced the issue yet... :-( |
still the hints you provided kinda point us towards some more circumvented direction. If I am unable to reproduce as well, I may setup an old MT7621 based router and give it a shot. |
Interesting enough, on my virtual setup I can reproduce the "Waiting for tunX to become free" issue by following your steps. But I don't see the underflow. |
I tried moving
Apply the following patch and set a shorter
|
This is racy: what guarantee do we have that between scheduling ovpn_decrypt_work() and executing it the peer isn't released and thus leading us to having a stale peer/timer pointer? |
the statement above is exactly the reason why the reference is obtained before scheduling the worker: we want to be sure that the peer stays alive all time long. |
This is indeed a problem. If only modify |
thanks for all these hints. they are very helpful! |
the master branch contains what we believe to be a fix for this issue. @claw6148 would you be able to give it a try? |
wow, tested several times and no hangings yet still prints |
I think that's the problem reported in #29 ? would you mind appending your log there? Thanks! |
So I'd say this issue can be closed. Thank you all for the feedback provided so far! |
As far as I understand this is already known?
When stopping an OpenVPN instance, the kernel log is getting messages like
The text was updated successfully, but these errors were encountered: