You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
xl2tpd kills random processes in addition to tearing down tunnels which are still active, and don't tear down tunnels that have in fact been shut down
#255
Open
jkroonza opened this issue
Feb 16, 2023
· 0 comments
· May be fixed by #261
xl2tpd doesn't properly clean up pppd processes that have terminated for individual tunnels and then at a later stage (when what I suspect is the control channel gets torn down) kills all the PIDs associated with the tunnels for a specific control connection, often taking down other legitimate processes (radiusd in this example) as collateral damage.
The remote side in this case is either a CISCO or Huawei router bringing PPPoE onto L2TP. The way I understand this is there is a single "control" association between each LAC (up to about a hundred in this case, but typically no more than 10 or so) and the LNS. Each such association can (and in this case does) then create multiple tunnels (ie, multiple ppp sessions between the same LAC and LNS pair). Total per server at any point in time there is about 250 pppd's running.
As the pppds terminate it seems xl2tpd doesn't properly clean up internal structures and tunnels.
Over time this results in xl2tpd keeping a list of pid's for no longer existing pppd daemons.
random processes getting killed (we had a number of rounds on mysqld and radiusd, so far fortunately nothing deadly critical like sshd or crond). We can explain why some processes (sshd/crond) are more "immune" than others but in short, their PID values are really, really long lived where others presumably changes IP addresses over time due to perhaps log rotation, or having been restarted for some other reason, and then obtaining an IP that some pppd process had at some point in the "recent" past. Thus when this happens xl2tpd starts sending TERM signals to PIDs that's no longer valid.
"Proof", from the radius logs (and similarly in mysqld logs when it got struck):
Thu Feb 16 10:19:29 2023 : Info: Signalled to terminate
Thu Feb 16 10:19:29 2023 : Info: Exiting normally
And correlating this to system logs, we found this:
Feb 16 10:19:29 kerberos xl2tpd[20651]: control_finish: Connection closed to 196.43.32.250, serial 0 (Terminated by PPPoE: LCP stopping)
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 27385
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 26738
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 23504
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 22899
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 22867
Feb 16 10:19:29 kerberos pppd[27385]: Terminating on signal 15
Feb 16 10:19:29 kerberos pppd[26738]: Terminating on signal 15
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 22835
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 13792
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 13362
Feb 16 10:19:29 kerberos xl2tpd[20651]: Terminating pppd: sending TERM signal to pid 32044
We see this every time after serial 0 is terminated, but never on other numbers. Without looking into protocol details this could be the "control tunnel". This could thus either be a bug on the remote side (Terminated by PPPoE: LCP stopping) doesn't make sense for me for a control channel. The reason is always LCP stopping on serial 0, and usually "Terminated by PPPoE: Received PPPoE PADT" on others, but very rarely also "Terminated by PPPoE: Received RADIUS disconnect message". The total counts just to give an idea of the frequency of the problem, for a period of just over 100 hours on one of the servers on which I'm troubleshooting this, 13 serial 0 LCP stoppings, 1 RADIUS and 6736 PADTs.
If you're aware of the current status in South Africa, it's perfectly conceivable that the LACs are going down due to load shedding, and thus why we're closing on serial 0. Normally I'd expect the LAC/LNS to only close the control connection if there are no other tunnels left, which in the specific scenario would almost never happen.
The fact that non-pppd's are being sent a TERM indicates that xl2tpd isn't cleaning up properly (further supported by #252) is a huge problem as in it could result in something really dangerous getting killed at some point.
To Reproduce
Set up xl2tpd to accept a LAC and permit multiple tunnels (config below). In our case we don't control the remote LAC side so I've no idea how to do this.
Have a few tunnels come up and down (notably ppp connections will usually be shut down from the remote end - not sure if this makes a difference or not).
arrange to have serial 0 be torn down with existing pppds (optionally) still running.
Note that PIDs that no longer belongs to a pppd process will be sent a SIGTERM (radiusd for example), by way of multiple of the PIDs that gets sent a TERM not logging a Terminating on signal 15 which is a log line produced by pppd.
Expected behavior
I expect xl2tpd to clean up it's internal structures properly such as to not:
Deplete it's IP pool (we're delegating now to avoid this issue).
Kill random other PIDs when control connections goes down.
l2tp detail:
xl2tpd 1.3.16
However this is the only two actual code changes from 1.3.16 to 1.3.18:
The first is commit "Fix issue where in some cases xl2tpd stops trying to connect", and doesn't affect this in any way that I can imagine.
The other is a control.c error which fixes write errors. I'm not sure if this one MIGHT make a difference. It might if this is somehow caused by the write() failing, but the old code would result in a "write returned LONGER than buffer", which I'm not seeing happening, so I don't think this relates either.
xl2tpd.conf
[global]
port = 1701
auth file = /etc/xl2tpd/l2tp-secrets
rand source = dev
listen-addr = 10.0.0.1
[lns fno]
lac = 10.1.0.0-10.1.0.127
local ip = 10.0.0.1
assign ip = no
require chap = yes
require pap = yes
require authentication = yes
pppoptfile = /etc/ppp/options.no
pass peer = no
[lns default]
exclusive = no
assign ip = no
pass peer = yes
pppoptfile = /etc/ppp/options.l2tpd
The latter is protected with IPSec, the fno LAC is not. We run xl2tpd without arguments
The text was updated successfully, but these errors were encountered:
Unsure the cause, but we found that upon ppp terminating xl2tpd would
only reap the PID, but not actually close the inner call, then at a
later stage would issue a kill() for that PID.
In our environment with high call turnover this would eventually result
in xl2tpd kill()'ing other critical services like mariadb and/or
syslog-ng which would upon reloads and rotations have a tendency to
re-use PIDs that were previously used by pppd processes.
We also believe that this should sort out the problem where IPs wouldn't
get cycled and re-used.
Closes: xelerance#252Closes: xelerance#255
Signed-off-by: Jaco Kroon <[email protected]>
Describe the bug
xl2tpd doesn't properly clean up pppd processes that have terminated for individual tunnels and then at a later stage (when what I suspect is the control channel gets torn down) kills all the PIDs associated with the tunnels for a specific control connection, often taking down other legitimate processes (radiusd in this example) as collateral damage.
The remote side in this case is either a CISCO or Huawei router bringing PPPoE onto L2TP. The way I understand this is there is a single "control" association between each LAC (up to about a hundred in this case, but typically no more than 10 or so) and the LNS. Each such association can (and in this case does) then create multiple tunnels (ie, multiple ppp sessions between the same LAC and LNS pair). Total per server at any point in time there is about 250 pppd's running.
As the pppds terminate it seems xl2tpd doesn't properly clean up internal structures and tunnels.
Over time this results in xl2tpd keeping a list of pid's for no longer existing pppd daemons.
This has a few side effects:
the first one we picked up: control_finish: Out of IP addresses on tunnel 13157! #252 where IP pools aren't cleaned up and eventually the daemon runs out of IPs to allocate. We've since delegated IP allocation to radius so control_finish: Out of IP addresses on tunnel 13157! #252 no longer affects us.
random processes getting killed (we had a number of rounds on mysqld and radiusd, so far fortunately nothing deadly critical like sshd or crond). We can explain why some processes (sshd/crond) are more "immune" than others but in short, their PID values are really, really long lived where others presumably changes IP addresses over time due to perhaps log rotation, or having been restarted for some other reason, and then obtaining an IP that some pppd process had at some point in the "recent" past. Thus when this happens xl2tpd starts sending TERM signals to PIDs that's no longer valid.
"Proof", from the radius logs (and similarly in mysqld logs when it got struck):
And correlating this to system logs, we found this:
We see this every time after serial 0 is terminated, but never on other numbers. Without looking into protocol details this could be the "control tunnel". This could thus either be a bug on the remote side (Terminated by PPPoE: LCP stopping) doesn't make sense for me for a control channel. The reason is always LCP stopping on serial 0, and usually "Terminated by PPPoE: Received PPPoE PADT" on others, but very rarely also "Terminated by PPPoE: Received RADIUS disconnect message". The total counts just to give an idea of the frequency of the problem, for a period of just over 100 hours on one of the servers on which I'm troubleshooting this, 13 serial 0 LCP stoppings, 1 RADIUS and 6736 PADTs.
If you're aware of the current status in South Africa, it's perfectly conceivable that the LACs are going down due to load shedding, and thus why we're closing on serial 0. Normally I'd expect the LAC/LNS to only close the control connection if there are no other tunnels left, which in the specific scenario would almost never happen.
The fact that non-pppd's are being sent a TERM indicates that xl2tpd isn't cleaning up properly (further supported by #252) is a huge problem as in it could result in something really dangerous getting killed at some point.
To Reproduce
Expected behavior
I expect xl2tpd to clean up it's internal structures properly such as to not:
l2tp detail:
However this is the only two actual code changes from 1.3.16 to 1.3.18:
The first is commit "Fix issue where in some cases xl2tpd stops trying to connect", and doesn't affect this in any way that I can imagine.
The other is a control.c error which fixes write errors. I'm not sure if this one MIGHT make a difference. It might if this is somehow caused by the write() failing, but the old code would result in a "write returned LONGER than buffer", which I'm not seeing happening, so I don't think this relates either.
xl2tpd.conf
The latter is protected with IPSec, the fno LAC is not. We run xl2tpd without arguments
The text was updated successfully, but these errors were encountered: