-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
freeswitch randomly overwrites files that it has open in r/w mode #2283
Comments
A not overwritten core.db SQLite database should start with $ od -X /dev/shm/core.db | head -1 |
Oh and using sofia-sip-1.13.16-1.el9.x86_64 but problem is also present with 1.13.9 and 1.13.14 and maybe versions going back to 1.13.0 (I've gone 1.13.0 then .3 .6 .9 .13 .14 and now .16) |
Thank you for the details. We will double check. |
Thanks. FYI I have attached the current fairly nasty patch I have been using to try to debug these issues. I've put a bodge in there for now that detects when the fd is pointing to a file that starts with / and avoided the SSL_write so the core.db overwrite is now bypassed but not fixed. That patch is very noisy but might help to understand the issues. I've also put in a few extra checks that avoid it hanging in SSL_read and SSL_write to try to make the service more usable. I suspect the real issue is higher up the tree, perhaps in tport_type_ws.c as that's the guy that calls ws_init() with what should be a socket but I have only briefly looked at that code so it could be elsewhere. |
I do have a 40MB log file from that debug code available if it would be useful? It contains ip addresses so it's not something I can share publicly. |
need to find out where file descriptors are closed by a mistake so then two different things use them after that. WS keeps using fd thinking the descriptor belongs to ws but it was actually opened by sqlite after close. |
Yeah but it's more than the sqlite fd affected I think. I have 19,000 log entries that hit the debug message that says "readlink failed" vs 110k that write "WS ws_raw_write fd %d target %s\n" and are pointing to open fd's for sockets/files. I did see a bunch of places where ws.c calls ws_close() directly and I can see that when it returns to tport_type_ws.c that then calls ws_destroy() which then calls ws_close() again. That looks a bit odd but not sure if related. |
|
If ever there was a reason to make sure you never run freeswitch as root... |
I use in memory db in FreeSWITCH and I never hit a problem like this. I'm not using wss btw. The issue might be caused by ssl close like this one: some one said switch back to a earlier version might fix this problem, see: freeswitch/sofia-sip#212 (comment) |
This issue is due to WSS so if you do not run it, you will not be hit by it. |
does it help if you revert freeswitch/sofia-sip@e77e70d |
I don't know but I have a working patch now that I think addresses the real cause of the problem. I need to tidy it up and remove all the unnecessary debug from it and then I will PR. I think it has multiple places in the code that cause it, due to SSL_ERROR_SYSCALL/SSL_EROR_SSL not being handled and attempting to use the sockets that received them even though the man page for both says "no further I/O operations should be performed on the connection and SSL_shutdown() must not be called". Since applying this latest patch at 05:36:00 this morning none of my debug messages that look at /proc/self/fd/$socket have triggered either for files nor have they said "readlink failed" vs about 22,000 of them in the previous 24 hours or so. |
@themsley-voiceflex Please see if that PR helps freeswitch/sofia-sip#233 |
Applied, rebuilt, tested. Does not seem to work at all. Starts up, does nothing. Lots of connections, cannot make calls. |
@themsley-voiceflex are you sure the patch was applied correctly? |
Yes, sure. I backed out all my changes git diff -R ae810c8872dee7547ed5e8443080d416ac0ba348 > /tmp/sofia-sip-ae810c8872dee7547ed5e8443080d416ac0ba348.patch then applied it, rebuilt, restarted, got almost no output from anything. Last thing it wrote to the console was 2023-10-25 23:58:29.889824 100.00% [INFO] switch_core.c:2503 FreeSWITCH Started 2023-10-26 00:00:02.500027 99.40% [NOTICE] mod_logfile.c:217 New log started. No registrations in that entire 8 minute period. Backed out to my latest version with the code from PR freeswitch/sofia-sip#231 and restarted 2023-10-26 00:06:36.531855 100.00% [INFO] switch_core.c:2503 FreeSWITCH Started |
I checked my logs (with the current PR231 applied) after restart and it's very rare that a whole minute goes by without a single registration, usually it's 2 or more, up to about a dozen, maybe 20 every minute. With PR233 on there were none at all for 8 mins. |
About the only thing I am not sure is correct in PR231 is https://github.com/freeswitch/sofia-sip/pull/231/files#diff-75fafb0c367c42d57236bcc76287aed6d582556d527057bdf103ef75d33e75d6R560 Doesn't mean PR231 is the correct way to do it but it does seem to work. |
@themsley-voiceflex besides of the problem reported in the subject of this issue, does FS accepts registrations from your clients when you're running sofia on current master? |
I will build and test later this evening. |
thanks, please rebuild this PR and re-test too freeswitch/sofia-sip#233 |
This does not seem to fix the problem. I built from a fresh git clone then applied PR 233 to it then added a patch so I could make sure that it was OK by printing out the fd's it would try to SSL_write to.
The idea there is that immediately before it hits SSL_write we know what it's about to write to. If it comes back with readlink failed then that is a dead fd which we should never be touching. When it says socket: something then it is at least a socket though we can't say from this if it is the right one or not. If it says anything starting with / then it may be about to overwrite an open file. So far I've had
Thankfully the only / matches are on my debug message
That's SSL_ERROR_SSL. I don't think we should ever get a failed readlink right before the SSL_write just after the "us closes the connection" comment. If we do it will fail because the fd is not there. There are 497 of those.
What I observe is that when an SSL call returns either SSL_ERROR_SYSCALL or SSL_ERROR_SSL then the socket that those errors reference is already gone. I think that when man SSL_get_error says "If this error occurs then no further I/O operations should be performed on the connection and SSL_shutdown() must not be called.", they really really mean no further I/O ! |
From my debugging of this so far, you can't rely on SSL_get_shutdown() telling you anything useful in the case of SSL_ERROR_SSL or _SYSCALL |
Looks like your clients abruptly and randomly terminate connections without sending WSOC_CLOSE msg (even not speaking about close_notify alert). You should investigate that, it's not normal and that shouldn't happen with the level of occurences you've given figures for. However, server should be resilient for that nasty behavior, so I'll keep you updated. Thanks for the effort you've put in that so far, much appreciated. |
My PR freeswitch/sofia-sip#231 fixes this completely, did you see it? |
@themsley-voiceflex I've just pushed to freeswitch/sofia-sip#233, please rebuild on it and let me know how it goes. |
That looks better so far. Added my debug code to the rebuild and I've seen no readlink failures at all which is hopeful. I do see that it went away for a very long time again as per #1934
so something still a bit odd still. I suspect this is a wait that I saw in ws_raw_read and I added a
just prior to the SSL_read as we shouldn't really need to read from something that has nothing pending and is also marked as down surely? Maybe we are there because something else isn't being tested though. I had a debug message in that code and it went through there 777 times with SSL_get_pending() = 0 and 7 times with it = 1. On those 7 times it did not wait on the SSL_read. Will see if I can catch a gcore of that next time. |
We call SSL_read() always on nonblocking bio right now, so that shouldn't be the case. |
Caught it gone to sleep again - been about 15 mins so far.
|
Last thing it wrote to fs_cli:
Last things it wrote from my added debug - I'd guess all the SSL errors are from things that were waiting and gave up and walked off:
|
can you show me the exact line of 870? (you've added your dbg logs, so it does not correspond to the code in repo) |
870 is the SSL_shutdown. |
Did it a second time while I was out, same line. Hmm, actually slightly different, looks like a different caller I think
|
Thanks, that was really useful. Pushed, please re-test. |
Rebuilding. While I wait, since you seem familiar with this code, is it meant to be multithreaded? If it is then I think I need to raise another bug! Specifically I see that thread_setup() uses CRYPTO_num_locks() which has returned only 1 since openssl 1.1.1d (or thereabouts). Build finished, reinstalled and bounced. |
Sofia is single thread |
Ok, so that is is in and running and I'll leave it there. I have an awk script that can detect gaps in the log so that will tell me later if it's hung and recovered while I wasn't looking. Hello weekend... |
This does appear to be much better but I think there is still a problem lurking. I set a script that watches
ws.c:870 is SSL_shutdown() 3 lines after the "us closes the connection" comment. I've changed the script interval to 240s now just to make sure that if it fires again it really is stuck. |
I suspected this can happen in theory and it happens in real life, OS is so nasty. |
@themsley-voiceflex just pushed the patch. Please re-test. |
Patched and rebuilt & deployed. Watching. |
Current situation: I have icinga monitoring the webrtc port once a minute on this just to retrieve the SSL cert and that has been OK now for 18h+ (when the patched version was installed and freeswitch bounced). I also have a watch on freeswitch.log that tells me when it's not been written to for more than 8s and so far during working hours (09:00 GMT +) I've had a 10s response time 3 times. Both of those numbers are a lot better than they have been. |
Just pushed the fix to non block on ws handshake. Please pull and rebuild. |
Did so this morning, been in and running since 2023-11-01 10:43:32 and my log watcher has reported 2 instances of freeswitch.log not being written to for 10s. One of those I can see it was still alive for, the other one at 12:55:20 it does appear that it went away for 10s at that time. So far so good. |
Should be fixed by freeswitch/sofia-sip#233 |
Describe the bug
Freeswitch randomly writes SSL data to files that it has open in r/w mode due to some sort of race condition.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
freeswitch does not do random things!
Package version or git hash
1.10.10
This is the same issue previously reported in #420 which was mistakenly closed as a SQLite problem.
I am 100% certain that it not for the following reasons:
we have the patched version of SQLite installed and in use
the bug in SQLite pointed to from core db gets corrupted #420 https://www.philipotoole.com/how-i-found-a-bug-in-sqlite/ is in a specific SLQite area that needs special action to invoke and freeswitch does not use this. This means the bug will probably never affect freeswitch so a SQLite upgrade is unnecessary. The SQLite bug as reported in that link affects only databases using this specifc set up https://www.sqlite.org/inmemorydb.html and to the best of my knowledge this way of using SQLite is not used by freeswitch. It is 100% definitely not used by our installation. And even if it was (which it isn't!) we have the patched version running.
our freeswitch instance is set up to use param name="core-db-name" value="/dev/shm/core.db" which is an ordinary file based SQLite database. It just happens to be on a filesystem that is 'in memory' but that is not the same thing as a SQLite 'in memory' database as described above. All these reasons rule out a SQLite problem.
In addition I have added debug code to libsofia-sip-ua/tport/ws.c in the ws_close() function, immediately prior to the SSL_write() that it uses. This debug code is
and when the bug is hit it shows the following information
Oct 18 07:50:47 fstrtc01.voiceflex.com stdbuf[389441]: WS ws_close fd 97 target /dev/shm/core.db
Oct 18 12:20:39 fstrtc01.voiceflex.com stdbuf[395088]: WS ws_close fd 63 target /dev/shm/core.db
Oct 18 19:20:46 fstrtc01.voiceflex.com stdbuf[403552]: WS ws_close fd 58 target /dev/shm/core.db
The SSL_write in ws_close() is issued immediately after that debug message and is writing SSL data to random files as per the similar problem that Facebook engineers found and debugged in their code referenced https://engineering.fb.com/2014/08/12/ios/debugging-file-corruption-on-ios/
"The SSL layer was writing to a socket that was already closed and subsequently reassigned to our database file. "
and
"Using a hex analyzer, we found a common prefix across the attachments: 17 03 03 00 28"
Our overwrite is not identical but similar enough for it to be the same problem:
Only /dev/shm/core.db-1697615458 does not start with 0x1703030012 and that has 0x1703030013 at +5 into the file.
Within a few seconds of those debug messages being issued we start to get
2023-10-18 07:34:41.466490 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.179.208.228
2023-10-18 07:50:47.946480 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is not a database]
BEGIN EXCLUSIVE
2023-10-18 07:50:48.026471 99.60% [CRIT] switch_core_sqldb.c:2109 ERROR [file is not a database], [db="/dev/shm/core.db",type="core_db"]
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active]
COMMIT
2023-10-18 07:50:48.086485 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
There are identical messages in the logs for the core.db overwrite at 12:20:39 and 19:20:46
Once this database is overwritten in this way then freeswitch will refuse to start up as the core.db.dsn database is corrupted and cannot be opened. It has to be deleted/renamed for freeswitch to start up again.
I also see hangs in freeswitch where it stops responding to connection attempts and a gcore taken at the time of this shows that it is stuck in ws_close() in the middle of the SSL_write() call. I suspect this is related and we are sending to a socket that is not expecting us to write to it (wild abandoned guess!) but this can be stuck there waiting for sometimes hours. I suspect this is also involved in #1934 where people are reporting hangs when attempting to use WSS. This is from the most recent gcore taken for fd 49 which hung from Oct 19 15:23:25 to 16:26:43 when it woke up again.
The text was updated successfully, but these errors were encountered: