Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeswitch randomly overwrites files that it has open in r/w mode #2283

Closed
themsley-voiceflex opened this issue Oct 19, 2023 · 48 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@themsley-voiceflex
Copy link

themsley-voiceflex commented Oct 19, 2023

Describe the bug
Freeswitch randomly writes SSL data to files that it has open in r/w mode due to some sort of race condition.

To Reproduce
Steps to reproduce the behavior:

  1. Set up a freeswitch profile to use WSS webrtc
  2. Expose it to the internet and allow people to attempt to use it
  3. after some time freeswitch will overwrite random filehandles that it has open with SSL data
  4. FreeSWITCH crashes with sqlite "file is not a database" messages in the majority of cases as the filehandle that is most often used and in r/w mode is the one for /dev/shm/core.db. This bug could potentially overwrite any file that is open in r/w mode although the more often it is opened the more likely it is to be a victim which is probably why it seems to pick on /dev/shm/core.db

Expected behavior
freeswitch does not do random things!

Package version or git hash
1.10.10

This is the same issue previously reported in #420 which was mistakenly closed as a SQLite problem.

I am 100% certain that it not for the following reasons:

  1. we have the patched version of SQLite installed and in use

  2. the bug in SQLite pointed to from core db gets corrupted  #420 https://www.philipotoole.com/how-i-found-a-bug-in-sqlite/ is in a specific SLQite area that needs special action to invoke and freeswitch does not use this. This means the bug will probably never affect freeswitch so a SQLite upgrade is unnecessary. The SQLite bug as reported in that link affects only databases using this specifc set up https://www.sqlite.org/inmemorydb.html and to the best of my knowledge this way of using SQLite is not used by freeswitch. It is 100% definitely not used by our installation. And even if it was (which it isn't!) we have the patched version running.

  3. our freeswitch instance is set up to use param name="core-db-name" value="/dev/shm/core.db" which is an ordinary file based SQLite database. It just happens to be on a filesystem that is 'in memory' but that is not the same thing as a SQLite 'in memory' database as described above. All these reasons rule out a SQLite problem.

In addition I have added debug code to libsofia-sip-ua/tport/ws.c in the ws_close() function, immediately prior to the SSL_write() that it uses. This debug code is

                /* check if no fatal error occurs on connection */
                char procbuf[1024];
                char fnbuf[1024];
                ssize_t fnlen;
                sprintf(procbuf,"/proc/self/fd/%d",wsh->sock);
                if ((fnlen = readlink(procbuf,fnbuf,sizeof(fnbuf)-1)) != -1) {
                        fnbuf[fnlen] = '\0';
                        printf("WS ws_close fd %d target %s\n",wsh->sock,fnbuf);
                        }
                else {
                        printf("WS ws_close fd %d readlink failed\n",wsh->sock);
                        }
                code = SSL_write(wsh->ssl, buf, 1);

and when the bug is hit it shows the following information

Oct 18 07:50:47 fstrtc01.voiceflex.com stdbuf[389441]: WS ws_close fd 97 target /dev/shm/core.db
Oct 18 12:20:39 fstrtc01.voiceflex.com stdbuf[395088]: WS ws_close fd 63 target /dev/shm/core.db
Oct 18 19:20:46 fstrtc01.voiceflex.com stdbuf[403552]: WS ws_close fd 58 target /dev/shm/core.db

The SSL_write in ws_close() is issued immediately after that debug message and is writing SSL data to random files as per the similar problem that Facebook engineers found and debugged in their code referenced https://engineering.fb.com/2014/08/12/ios/debugging-file-corruption-on-ios/

"The SSL layer was writing to a socket that was already closed and subsequently reassigned to our database file. "
and
"Using a hex analyzer, we found a common prefix across the attachments: 17 03 03 00 28"

Our overwrite is not identical but similar enough for it to be the same problem:

$ for dbfile in $(ls /dev/shm/core.db-*); do echo $dbfile; od -X $dbfile | head -1;done
/dev/shm/core.db-1697017847
0000000 00030317 2299b112 b8d06713 442802c5
/dev/shm/core.db-1697359841
0000000 00030317 10e68912 769c4a0c 587aac68
/dev/shm/core.db-1697457656
0000000 00030317 cdc06712 26b9d175 b91379ec
/dev/shm/core.db-1697474154
0000000 00030317 0298eb12 10f4a06d a38d8a70
/dev/shm/core.db-1697477136
0000000 00030317 39243212 f67342eb 4eedfd89
/dev/shm/core.db-1697537157
0000000 00030317 bbc1a212 0dd52644 ae8bd737
/dev/shm/core.db-1697547645
0000000 00030317 c42d2612 233c9109 1cb91ff7
/dev/shm/core.db-1697569857
0000000 00030317 9e3fe612 3b797957 df9c0fba
/dev/shm/core.db-1697604657
0000000 00030317 af282512 3c08939a ac980ecf
/dev/shm/core.db-1697615458
0000000 694c5153 03031774 dd7c1300 2223e3b0
/dev/shm/core.db-1697631649
0000000 00030317 21e31112 bad65ac4 f6609114
/dev/shm/core.db-1697656856
0000000 00030317 42088b12 b27bce2d f3ed235c

Only /dev/shm/core.db-1697615458 does not start with 0x1703030012 and that has 0x1703030013 at +5 into the file.

Within a few seconds of those debug messages being issued we start to get

2023-10-18 07:34:41.466490 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.179.208.228
2023-10-18 07:50:47.946480 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is not a database]
BEGIN EXCLUSIVE
2023-10-18 07:50:48.026471 99.60% [CRIT] switch_core_sqldb.c:2109 ERROR [file is not a database], [db="/dev/shm/core.db",type="core_db"]
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active]
COMMIT
2023-10-18 07:50:48.086485 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233

There are identical messages in the logs for the core.db overwrite at 12:20:39 and 19:20:46

Once this database is overwritten in this way then freeswitch will refuse to start up as the core.db.dsn database is corrupted and cannot be opened. It has to be deleted/renamed for freeswitch to start up again.

I also see hangs in freeswitch where it stops responding to connection attempts and a gcore taken at the time of this shows that it is stuck in ws_close() in the middle of the SSL_write() call. I suspect this is related and we are sending to a socket that is not expecting us to write to it (wild abandoned guess!) but this can be stuck there waiting for sometimes hours. I suspect this is also involved in #1934 where people are reporting hangs when attempting to use WSS. This is from the most recent gcore taken for fd 49 which hung from Oct 19 15:23:25 to 16:26:43 when it woke up again.

(gdb) thread apply 18 bt full

Thread 18 (Thread 0x7f078bfff640 (LWP 450637)):
#0  0x00007f07bd53e91c in read () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f07bccdd091 in sock_read () from /lib64/libcrypto.so.3
No symbol table info available.
#2  0x00007f07bcccd44b in bread_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3  0x00007f07bccd0445 in bio_read_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4  0x00007f07bccd05c7 in BIO_read () from /lib64/libcrypto.so.3
No symbol table info available.
#5  0x00007f07bd130e8c in ssl3_read_n.part () from /lib64/libssl.so.3
No symbol table info available.
#6  0x00007f07bd132af9 in ssl3_read_bytes () from /lib64/libssl.so.3
No symbol table info available.
#7  0x00007f07bd1425c0 in state_machine.part () from /lib64/libssl.so.3
No symbol table info available.
#8  0x00007f07bd13076e in ssl3_write_bytes () from /lib64/libssl.so.3
No symbol table info available.
#9  0x00007f07bd1136e7 in SSL_write () from /lib64/libssl.so.3
No symbol table info available.
#10 0x00007f07bd2a5340 in ws_close (wsh=wsh@entry=0x7f0781f71240, reason=reason@entry=0) at tport/ws.c:900
        ssl_error = 0
        buf = 0x7f07bd2d41b6 "0"
        procbuf = "/proc/self/fd/49\000\000\316\201\a\177\000\000\320\347\377\213\a\177\000\000\310\347\377\213\a\177\000\000\204\236(\275\a\177\000\000\220y2\200\a\177\000\000\300\240\061\275\a\177\000\000P\303\316\201\a\177\000\000\200\350\377\213\a\177\000\000\000\000\000\000\000\000\000\000֢(\275\a\177\000\000\377\377\377\177\000\000\000\000\064Z)\275\a\177\000\000tag=ZNS5\337Q-\275\a\177\000\000\377\377\377\377\377\377\377\377?B\017\000\000\000\000\000P\303\316\201\a\177\000\000\000\326\354\367\376AN\245\377\377\377\177\000\000\000\000P\303\316\201\a\177\000\000\000\000\000\000\000\000\000\000\365\022*\275\a\177\000\000P\303\316\201\a\177\000\000\000\067a\202\a\177"...
        fnbuf = "socket:[9955738]\000\000\000\000\000\000\000\000d\002\000\000\000\000\000\000\001", '\000' <repeats 15 times>, "\307\002\000\000\000\000\000\000\325\371J\275\a\177\000\000\236\304\333\350\000\000\000\000\273\360\b\000\000\000\000\000@\345\377\213\a\177\000\000@\345\377\213\a\177\000\000\340\vy\202\a\177\000\000\361\001*\275\a\177\000\000\001\000\000\000\000\000\000\000\240\344\377\213\a\177\000\000\060\067a\202\a\177\000\000\001\201\001\200\a\177\000\000\312\326,\275\a\177\000\000\020\202\001\200\a\177\000\000d\002\000\000\000\000\000\000`.Ё\000\000\000\004\000\nV\022C\000\000\000\000\003\000\n]_|\006", '\000' <repeats 25 times>...
        fnlen = <optimized out>
        code = 0
#11 0x00007f07bd2a5bd3 in ws_destroy (wsh=0x7f0781f71240) at tport/ws.c:836
No locals.
#12 0x00007f07bd2a5ca3 in tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:536
        wstp = <optimized out>
        wstp = <optimized out>
        __func__ = {<optimized out> <repeats 26 times>}
#13 tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:530
        wstp = 0x7f0781f71050
        __func__ = "tport_ws_deinit_secondary"
#14 0x00007f07bd294df8 in tport_zap_secondary (self=0x7f0781f71050) at tport/tport.c:1101
        mr = <optimized out>
        __func__ = "tport_zap_secondary"
#15 0x00007f07bd287ed8 in su_timer_expire (timers=timers@entry=0x7f0780000ba8, timeout=timeout@entry=0x7f078bffec50, now=...) at su/su_timer.c:587
        t = 0x7f07836cc9e0
        f = 0x7f07bd296ed0 <tport_secondary_timer>
        n = 42
        __PRETTY_FUNCTION__ = "su_timer_expire"
#16 0x00007f07bd288165 in su_base_port_run (self=0x7f0780000b60) at su/su_base_port.c:339
        now = {tv_sec = <optimized out>, tv_usec = <optimized out>}
        tout = 15000
        tout2 = 0
        __PRETTY_FUNCTION__ = "su_base_port_run"
#17 0x00007f07bd28b1f3 in su_pthread_port_clone_main (varg=0x7f07b916b580) at su/su_pthread_port.c:343
        arg = 0x0
        task = {{sut_port = 0x7f0780000b60, sut_root = 0x7f07800013d0}}
        zap = 1
#18 0x00007f07bd49f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#19 0x00007f07bd43f450 in clone3 () from /lib64/libc.so.6
@themsley-voiceflex themsley-voiceflex added the bug Something isn't working label Oct 19, 2023
@themsley-voiceflex
Copy link
Author

A not overwritten core.db SQLite database should start with

$ od -X /dev/shm/core.db | head -1
0000000 694c5153 66206574 616d726f 00332074
or more usefully in characters "SQLite format 3"

@themsley-voiceflex
Copy link
Author

themsley-voiceflex commented Oct 19, 2023

I'm reading through the issues backlog and I suspect that the following may be related to this
#2247
#2035
#1934
#1920
#1871 maybe
#1558
#1513 possible but less likely
#922
#892 maybe but looks unlikely
and of course #420

@themsley-voiceflex
Copy link
Author

themsley-voiceflex commented Oct 19, 2023

Oh and using sofia-sip-1.13.16-1.el9.x86_64 but problem is also present with 1.13.9 and 1.13.14 and maybe versions going back to 1.13.0 (I've gone 1.13.0 then .3 .6 .9 .13 .14 and now .16)

@andywolk andywolk self-assigned this Oct 20, 2023
@andywolk
Copy link
Contributor

Thank you for the details. We will double check.

@themsley-voiceflex
Copy link
Author

Thanks. FYI I have attached the current fairly nasty patch I have been using to try to debug these issues. I've put a bodge in there for now that detects when the fd is pointing to a file that starts with / and avoided the SSL_write so the core.db overwrite is now bypassed but not fixed. That patch is very noisy but might help to understand the issues. I've also put in a few extra checks that avoid it hanging in SSL_read and SSL_write to try to make the service more usable.

I suspect the real issue is higher up the tree, perhaps in tport_type_ws.c as that's the guy that calls ws_init() with what should be a socket but I have only briefly looked at that code so it could be elsewhere.
sofia-sip-ws-ipaddr.patch.txt

@themsley-voiceflex
Copy link
Author

I do have a 40MB log file from that debug code available if it would be useful? It contains ip addresses so it's not something I can share publicly.

@andywolk
Copy link
Contributor

need to find out where file descriptors are closed by a mistake so then two different things use them after that. WS keeps using fd thinking the descriptor belongs to ws but it was actually opened by sqlite after close.

@themsley-voiceflex
Copy link
Author

Yeah but it's more than the sqlite fd affected I think. I have 19,000 log entries that hit the debug message that says "readlink failed" vs 110k that write "WS ws_raw_write fd %d target %s\n" and are pointing to open fd's for sockets/files.

I did see a bunch of places where ws.c calls ws_close() directly and I can see that when it returns to tport_type_ws.c that then calls ws_destroy() which then calls ws_close() again. That looks a bit odd but not sure if related.

@themsley-voiceflex
Copy link
Author

[root@fstrtc01 ~]# journalctl -u freeswitch -S  "2023-10-20 03:42:33" | grep '/' | grep -v ill
Oct 20 03:59:09 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 112 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 04:29:11 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 112 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 04:31:11 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 112 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 04:37:11 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 116 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 04:39:11 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 118 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 04:47:12 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 118 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 05:59:15 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 142 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 08:08:07 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 214 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 09:04:45 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 305 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 09:04:45 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 226 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 09:04:45 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 171 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 09:04:45 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 305 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 09:04:50 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 304 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 09:18:41 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 297 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 09:18:41 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 206 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 09:18:47 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 266 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 10:58:29 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 141 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 11:15:53 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 196 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 11:53:32 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 155 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 12:13:10 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 256 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 12:13:10 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 250 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 12:41:14 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 257 target /opt/freeswitch/sounds/music/32000/bleep/DialtoneCallWaiting32.wav
Oct 20 12:41:14 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 271 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 14:43:49 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 212 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 15:11:19 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 146 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 15:11:19 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 231 target /opt/freeswitch/sounds/music/16000/bleep/DialtoneCallWaiting16.wav
Oct 20 15:28:38 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 179 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 15:31:30 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 215 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 15:31:30 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 167 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav
Oct 20 15:57:33 fstrtc01.voiceflex.com stdbuf[483339]: WS ws_close fd 222 target /opt/freeswitch/sounds/music/8000/bleep/DialtoneCallWaiting8.wav

@themsley-voiceflex
Copy link
Author

If ever there was a reason to make sure you never run freeswitch as root...
Oct 21 14:13:47 fstrtc01.voiceflex.com stdbuf[517990]: WS ws_close fd 118 target /etc/passwd

@seven1240
Copy link
Collaborator

I use in memory db in FreeSWITCH and I never hit a problem like this. I'm not using wss btw.

The issue might be caused by ssl close like this one:

freeswitch/sofia-sip#212

some one said switch back to a earlier version might fix this problem, see: freeswitch/sofia-sip#212 (comment)

@themsley-voiceflex
Copy link
Author

This issue is due to WSS so if you do not run it, you will not be hit by it.

@andywolk
Copy link
Contributor

does it help if you revert freeswitch/sofia-sip@e77e70d

@themsley-voiceflex
Copy link
Author

themsley-voiceflex commented Oct 22, 2023

I don't know but I have a working patch now that I think addresses the real cause of the problem. I need to tidy it up and remove all the unnecessary debug from it and then I will PR. I think it has multiple places in the code that cause it, due to SSL_ERROR_SYSCALL/SSL_EROR_SSL not being handled and attempting to use the sockets that received them even though the man page for both says "no further I/O operations should be performed on the connection and SSL_shutdown() must not be called".

Since applying this latest patch at 05:36:00 this morning none of my debug messages that look at /proc/self/fd/$socket have triggered either for files nor have they said "readlink failed" vs about 22,000 of them in the previous 24 hours or so.

@themsley-voiceflex
Copy link
Author

@andywolk
Copy link
Contributor

@themsley-voiceflex Please see if that PR helps freeswitch/sofia-sip#233

@themsley-voiceflex
Copy link
Author

Applied, rebuilt, tested. Does not seem to work at all. Starts up, does nothing. Lots of connections, cannot make calls.

@andywolk
Copy link
Contributor

@themsley-voiceflex are you sure the patch was applied correctly?

@themsley-voiceflex
Copy link
Author

Yes, sure. I backed out all my changes

git diff -R ae810c8872dee7547ed5e8443080d416ac0ba348 > /tmp/sofia-sip-ae810c8872dee7547ed5e8443080d416ac0ba348.patch

then applied it, rebuilt, restarted, got almost no output from anything.

Last thing it wrote to the console was

2023-10-25 23:58:29.889824 100.00% [INFO] switch_core.c:2503
FreeSWITCH Version 1.10.10-release~64bit (-release 64bit)

FreeSWITCH Started
Max Sessions [4000]
Session Rate [4000]
SQL [Enabled]

2023-10-26 00:00:02.500027 99.40% [NOTICE] mod_logfile.c:217 New log started.
2023-10-26 00:01:01.740280 99.27% [INFO] switch_time.c:627 Clock synchronized to system time.
2023-10-26 00:02:29.200025 99.50% [ERR] switch_odbc.c:529 ERR: [update sip_registrations set ping_expires = 1698278578 where hostname='fstrtc01.voiceflex.com' and profile_name='internaltcp' and ping_expires <= 1698278548 ]
[STATE: 40001 CODE 1213 ERROR: [ma-3.1.12][5.5.68-MariaDB]Deadlock found when trying to get lock; try restarting transaction
]
2023-10-26 00:02:29.200025 99.50% [ERR] switch_core_sqldb.c:728 [db="freeswitch";type="odbc"user="nebulaip";pass="Nebula1p"] ODBC SQL ERR [STATE: 40001 CODE 1213 ERROR: [ma-3.1.12][5.5.68-MariaDB]Deadlock found when trying to get lock; try restarting transaction
]
update sip_registrations set ping_expires = 1698278578 where hostname='fstrtc01.voiceflex.com' and profile_name='internaltcp' and ping_expires <= 1698278548
2023-10-26 00:06:17.181955 95.47% [INFO] switch_core.c:2891 Shutting down

No registrations in that entire 8 minute period. Backed out to my latest version with the code from PR freeswitch/sofia-sip#231 and restarted

2023-10-26 00:06:36.531855 100.00% [INFO] switch_core.c:2503
FreeSWITCH Version 1.10.10-release~64bit (-release 64bit)

FreeSWITCH Started
Max Sessions [4000]
Session Rate [4000]
SQL [Enabled]
2023-10-26 00:06:36.584927 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.219.37.21
2023-10-26 00:06:36.602676 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.210.146.241
2023-10-26 00:06:36.621780 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.102.11.112
2023-10-26 00:06:36.681802 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.24.145.7
2023-10-26 00:06:36.881801 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.103.181.164
2023-10-26 00:06:36.981824 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.25.214.147
2023-10-26 00:06:37.181780 100.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.97.66.47

@themsley-voiceflex
Copy link
Author

I checked my logs (with the current PR231 applied) after restart and it's very rare that a whole minute goes by without a single registration, usually it's 2 or more, up to about a dozen, maybe 20 every minute. With PR233 on there were none at all for 8 mins.

@themsley-voiceflex
Copy link
Author

About the only thing I am not sure is correct in PR231 is https://github.com/freeswitch/sofia-sip/pull/231/files#diff-75fafb0c367c42d57236bcc76287aed6d582556d527057bdf103ef75d33e75d6R560

Doesn't mean PR231 is the correct way to do it but it does seem to work.

@jakubkarolczyk
Copy link
Collaborator

@themsley-voiceflex besides of the problem reported in the subject of this issue, does FS accepts registrations from your clients when you're running sofia on current master?

@themsley-voiceflex
Copy link
Author

I will build and test later this evening.

@jakubkarolczyk
Copy link
Collaborator

thanks, please rebuild this PR and re-test too freeswitch/sofia-sip#233

@themsley-voiceflex
Copy link
Author

This does not seem to fix the problem. I built from a fresh git clone then applied PR 233 to it then added a patch so I could make sure that it was OK by printing out the fd's it would try to SSL_write to.

--- libsofia-sip-ua/tport/ws.c.orig     2023-10-26 20:04:05.030422865 +0000
+++ libsofia-sip-ua/tport/ws.c  2023-10-26 20:06:33.677116924 +0000
@@ -669,6 +669,7 @@
                        if (code < 0) {
                                int ssl_err = SSL_get_error(wsh->ssl, code);
                                if (!SSL_WANT_READ_WRITE(ssl_err)) {
+                                       printf("WS estab -1 code <0 ssl_err %d errno %s fd %d\n",ssl_err, ssl_err == SSL_ERROR_SYSCALL ? strerror(errno):"N/A", wsh->sock);
                                        wss_error(wsh, ssl_err, "establish_logical_layer: SSL_accept");
                                        return -1;
                                }
@@ -833,6 +834,17 @@

                /* peer closes the connection */
                if (SSL_get_shutdown(wsh->ssl) & SSL_RECEIVED_SHUTDOWN) {
+                       char procbuf[1024], fnbuf[1024];
+                       ssize_t fnlen;
+                       sprintf(procbuf,"/proc/self/fd/%d",wsh->sock);
+                       if ((fnlen = readlink(procbuf,fnbuf,sizeof(fnbuf)-1)) != -1) {
+                               fnbuf[fnlen] = '\0';
+                               printf("WS ws_raw_write rs fd %d target %s\n",wsh->sock,fnbuf);
+                               }
+                       else
+                               {
+                               printf("WS ws_raw_write rs fd %d readlink failed\n",wsh->sock);
+                               }
                        code = SSL_write(wsh->ssl, buf, 1);
                        ssl_error = SSL_get_error(wsh->ssl, code);
                        if (ssl_error == SSL_ERROR_SYSCALL || ssl_error == SSL_ERROR_SSL) {
@@ -844,6 +856,17 @@
                }

                /* us closes the connection. We do bidirection shutdown handshake */
+               char procbuf[1024], fnbuf[1024];
+               ssize_t fnlen;
+               sprintf(procbuf,"/proc/self/fd/%d",wsh->sock);
+               if ((fnlen = readlink(procbuf,fnbuf,sizeof(fnbuf)-1)) != -1) {
+                       fnbuf[fnlen] = '\0';
+                       printf("WS ws_raw_write ss fd %d target %s\n",wsh->sock,fnbuf);
+                       }
+               else
+                       {
+                       printf("WS ws_raw_write ss fd %d readlink failed\n",wsh->sock);
+                       }
                code = SSL_write(wsh->ssl, buf, 1);
                ssl_error = SSL_get_error(wsh->ssl, code);
                if (ssl_error == SSL_ERROR_SYSCALL || ssl_error == SSL_ERROR_SSL) {

The idea there is that immediately before it hits SSL_write we know what it's about to write to. If it comes back with readlink failed then that is a dead fd which we should never be touching. When it says socket: something then it is at least a socket though we can't say from this if it is the right one or not. If it says anything starting with / then it may be about to overwrite an open file.

So far I've had

# for pattern in 'ss fd.*readlink' 'rs fd.*readlink' 'rs.*socket' "ss.*socket" '/' 'ssl_err'; do echo $pattern; journalctl -u freeswitch -S "2023-10-26 20:13:38" | grep -c "$pattern" ;done
ss fd.*readlink
497
rs fd.*readlink
0
rs.*socket
30
ss.*socket
762
/
9
ssl_err
9

Thankfully the only / matches are on my debug message

Oct 26 20:13:47 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 120
Oct 26 20:13:47 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 120
Oct 26 20:13:47 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 138
Oct 26 20:13:48 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 158
Oct 26 20:13:48 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 159
Oct 26 20:19:36 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 215
Oct 26 20:28:57 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 179
Oct 26 20:36:09 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 206
Oct 26 20:40:24 fstrtc01.voiceflex.com stdbuf[827898]: WS estab -1 code <0 ssl_err 1 errno N/A fd 147

That's SSL_ERROR_SSL.

I don't think we should ever get a failed readlink right before the SSL_write just after the "us closes the connection" comment. If we do it will fail because the fd is not there. There are 497 of those.

Oct 26 20:13:48 fstrtc01.voiceflex.com stdbuf[827898]: WS ws_raw_write ss fd 162 readlink failed
Oct 26 20:13:50 fstrtc01.voiceflex.com stdbuf[827898]: WS ws_raw_write ss fd 151 readlink failed
Oct 26 20:13:50 fstrtc01.voiceflex.com stdbuf[827898]: WS ws_raw_write ss fd 149 readlink failed

What I observe is that when an SSL call returns either SSL_ERROR_SYSCALL or SSL_ERROR_SSL then the socket that those errors reference is already gone. I think that when man SSL_get_error says "If this error occurs then no further I/O operations should be performed on the connection and SSL_shutdown() must not be called.", they really really mean no further I/O !

@themsley-voiceflex
Copy link
Author

From my debugging of this so far, you can't rely on SSL_get_shutdown() telling you anything useful in the case of SSL_ERROR_SSL or _SYSCALL

@jakubkarolczyk
Copy link
Collaborator

jakubkarolczyk commented Oct 26, 2023

Looks like your clients abruptly and randomly terminate connections without sending WSOC_CLOSE msg (even not speaking about close_notify alert). You should investigate that, it's not normal and that shouldn't happen with the level of occurences you've given figures for. However, server should be resilient for that nasty behavior, so I'll keep you updated. Thanks for the effort you've put in that so far, much appreciated.

@themsley-voiceflex
Copy link
Author

My PR freeswitch/sofia-sip#231 fixes this completely, did you see it?

@jakubkarolczyk
Copy link
Collaborator

@themsley-voiceflex I've just pushed to freeswitch/sofia-sip#233, please rebuild on it and let me know how it goes.

@themsley-voiceflex
Copy link
Author

themsley-voiceflex commented Oct 27, 2023

That looks better so far. Added my debug code to the rebuild and I've seen no readlink failures at all which is hopeful. I do see that it went away for a very long time again as per #1934

2023-10-27 20:16:25.311821 98.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.40.169.175
2023-10-27 20:16:26.591812 98.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.222.32.14
2023-10-27 20:16:27.251830 98.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.37.64.179
2023-10-27 20:32:27.911831 99.00% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.132.237.49

so something still a bit odd still. I suspect this is a wait that I saw in ws_raw_read and I added a

+                       if ((!SSL_has_pending(wsh->ssl)) && wsh->down) {
+                               r = -1;
+                               goto end;
+                       }

just prior to the SSL_read as we shouldn't really need to read from something that has nothing pending and is also marked as down surely? Maybe we are there because something else isn't being tested though.

I had a debug message in that code and it went through there 777 times with SSL_get_pending() = 0 and 7 times with it = 1. On those 7 times it did not wait on the SSL_read.

Will see if I can catch a gcore of that next time.

@jakubkarolczyk
Copy link
Collaborator

We call SSL_read() always on nonblocking bio right now, so that shouldn't be the case.

@themsley-voiceflex
Copy link
Author

Caught it gone to sleep again - been about 15 mins so far.

Thread 18 (Thread 0x7f9eaadfe640 (LWP 872606)):
#0  0x00007f9ec3f3e91c in read () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f9ec36dd091 in sock_read () from /lib64/libcrypto.so.3
No symbol table info available.
#2  0x00007f9ec36cd44b in bread_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3  0x00007f9ec36d0445 in bio_read_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4  0x00007f9ec36d05c7 in BIO_read () from /lib64/libcrypto.so.3
No symbol table info available.
#5  0x00007f9ec3b30e8c in ssl3_read_n.part () from /lib64/libssl.so.3
No symbol table info available.
#6  0x00007f9ec3b32af9 in ssl3_read_bytes () from /lib64/libssl.so.3
No symbol table info available.
#7  0x00007f9ec3b0e3d4 in ssl3_shutdown () from /lib64/libssl.so.3
No symbol table info available.
#8  0x00007f9ec3ca523e in ws_close (wsh=0x7f9e8c5979c0, reason=<optimized out>) at tport/ws.c:870
        rcode = <optimized out>
        ssl_error = <optimized out>
        code = <optimized out>
        n = 100
        block_n = 100
#9  0x00007f9ec3ca6dfe in tport_recv_stream_ws (self=0x7f9e8c5977d0) at tport/tport_type_ws.c:217
        msg = <optimized out>
        n = <optimized out>
        N = <optimized out>
        veclen = <optimized out>
        i = <optimized out>
        m = <optimized out>
        err = <optimized out>
        iovec = {{siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}}
        wstp = 0x7f9e8c5977d0
        data = 0x7f9e8d27c302 "\310L)\372˥"
        oc = WSOC_CLOSE
        __func__ = "tport_recv_stream_ws"
        __PRETTY_FUNCTION__ = "tport_recv_stream_ws"
#10 0x00007f9ec3c9aa13 in tport_recv_data (self=0x7f9e8c5977d0) at tport/tport.c:2890
No locals.
#11 tport_recv_event (self=self@entry=0x7f9e8c5977d0) at tport/tport.c:2904
        again = <optimized out>
        __func__ = "tport_recv_event"
#12 0x00007f9ec3c9abbb in tport_base_wakeup (self=0x7f9e8c5977d0, events=1) at tport/tport.c:2828
        error = 0
#13 0x00007f9ec3c9ad9d in tport_wakeup (magic=<optimized out>, w=0x7f9e8dc3f804, self=0x7f9e8c5977d0) at tport/tport.c:2807
        events = 1
        error = <optimized out>
        __PRETTY_FUNCTION__ = "tport_wakeup"
        __func__ = "tport_wakeup"
#14 0x00007f9ec3c86a3a in su_epoll_port_wait_events (self=0x7f9e8c000b60, tout=<optimized out>) at su/su_epoll_port.c:510
        ser = 0x7f9e8dc3f7e0
        magic = <optimized out>
        j = 0
        n = 1
        events = 0
        index = <optimized out>
        version = 67676
        M = 4
        ev = {{events = 1, data = {ptr = 0x8b, fd = 139, u32 = 139, u64 = 139}}, {events = 32670, data = {ptr = 0x653d0bb5, fd = 1698499509, u32 = 1698499509, u64 = 1698499509}}, {events = 982460586, data = {ptr = 0xe8e78a3500000000, fd = 0, u32 = 0, u64 = 16782534496657342464}}, {events = 0, data = {ptr = 0x7f9ec3c88057 <su_timer_expire+855>, fd = -1010270121, u32 = 3284697175, u64 = 140319866257495}}}
        __PRETTY_FUNCTION__ = "su_epoll_port_wait_events"
#15 0x00007f9ec3c88168 in su_base_port_run (self=0x7f9e8c000b60) at su/su_base_port.c:349
        tout = 195
        tout2 = 0
        __PRETTY_FUNCTION__ = "su_base_port_run"
#16 0x00007f9ec3c8b1b3 in su_pthread_port_clone_main (varg=0x7f9ec04d4580) at su/su_pthread_port.c:343
        arg = 0x0
        task = {{sut_port = 0x7f9e8c000b60, sut_root = 0x7f9e8c0013d0}}
        zap = 1
#17 0x00007f9ec3e9f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#18 0x00007f9ec3e3f450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

@themsley-voiceflex
Copy link
Author

Last thing it wrote to fs_cli:

2023-10-28 13:25:09.971835 98.43% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.94.26.32
2023-10-28 13:41:26.931796 99.20% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.132.228.124

Last things it wrote from my added debug - I'd guess all the SSL errors are from things that were waiting and gave up and walked off:

Oct 28 13:25:11 fstrtc01.voiceflex.com stdbuf[872587]: WS ws_raw_write ss fd 66 target socket:[18735453]
Oct 28 13:25:11 fstrtc01.voiceflex.com stdbuf[872587]: WS ws_raw_write ss fd 66 target socket:[18735453]
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 66
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 49
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS ws_raw_write ss fd 175 target socket:[18729103]
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS ws_raw_write ss fd 86 target socket:[18725147]
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 1 errno N/A fd 48
Oct 28 13:41:26 fstrtc01.voiceflex.com stdbuf[872587]: WS estab -1 code <0 ssl_err 5 errno Broken pipe fd 4

@jakubkarolczyk
Copy link
Collaborator

tport/ws.c:870

can you show me the exact line of 870? (you've added your dbg logs, so it does not correspond to the code in repo)

@themsley-voiceflex
Copy link
Author

                /* us closes the connection. We do bidirection shutdown handshake */
                do {
                        restore_socket(wsh->sock);
                        code = SSL_shutdown(wsh->ssl);
                        if (code == 0) {
                                /* need to make sure there are no more data to read */
                                setup_socket(wsh->sock);
                                do {

870 is the SSL_shutdown.

@themsley-voiceflex
Copy link
Author

themsley-voiceflex commented Oct 28, 2023

Did it a second time while I was out, same line. Hmm, actually slightly different, looks like a different caller I think

Thread 18 (Thread 0x7f9eaadfe640 (LWP 872606)):
#0  0x00007f9ec3f3e91c in read () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f9ec36dd091 in sock_read () from /lib64/libcrypto.so.3
No symbol table info available.
#2  0x00007f9ec36cd44b in bread_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3  0x00007f9ec36d0445 in bio_read_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4  0x00007f9ec36d05c7 in BIO_read () from /lib64/libcrypto.so.3
No symbol table info available.
#5  0x00007f9ec3b30e8c in ssl3_read_n.part () from /lib64/libssl.so.3
No symbol table info available.
#6  0x00007f9ec3b32af9 in ssl3_read_bytes () from /lib64/libssl.so.3
No symbol table info available.
#7  0x00007f9ec3b0e3d4 in ssl3_shutdown () from /lib64/libssl.so.3
No symbol table info available.
#8  0x00007f9ec3ca523e in ws_close (wsh=wsh@entry=0x7f9e8cf54b20, reason=reason@entry=0) at tport/ws.c:870
        rcode = <optimized out>
        ssl_error = <optimized out>
        code = <optimized out>
        n = 100
        block_n = 100
#9  0x00007f9ec3ca56de in ws_handshake (wsh=wsh@entry=0x7f9e8cf54b20) at tport/ws.c:348
        key = '\000' <repeats 255 times>
        version = "\000\000\000\000"
        proto = '\000' <repeats 255 times>
        proto_buf = '\000' <repeats 383 times>
        input = '\000' <repeats 511 times>
        output = '\000' <repeats 19 times>
        b64 = '\000' <repeats 255 times>
        respond = '\000' <repeats 1023 times>
        bytes = -1
        p = <optimized out>
        e = <optimized out>
#10 0x00007f9ec3ca615d in establish_logical_layer (wsh=0x7f9e8cf54b20) at tport/ws.c:715
        r = <optimized out>
        __PRETTY_FUNCTION__ = {<optimized out> <repeats 24 times>}
#11 establish_logical_layer (wsh=wsh@entry=0x7f9e8cf54b20) at tport/ws.c:652
        __PRETTY_FUNCTION__ = "establish_logical_layer"
        code = <optimized out>
        ssl_err = <optimized out>
        r = <optimized out>
#12 0x00007f9ec3ca67f2 in ws_read_frame (wsh=0x7f9e8cf54b20, oc=<optimized out>, data=<optimized out>) at tport/ws.c:951
        need = 2
        maskp = 0x0
        ll = <optimized out>
        frag = 0
        blen = <optimized out>
        again = <optimized out>
#13 0x00007f9ec3ca6dfe in tport_recv_stream_ws (self=0x7f9e8cf54930) at tport/tport_type_ws.c:217
        msg = <optimized out>
        n = <optimized out>
        N = <optimized out>
        veclen = <optimized out>
        i = <optimized out>
        m = <optimized out>
        err = <optimized out>
        iovec = {{siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}, {siv_base = 0x0, siv_len = 0}}
        wstp = 0x7f9e8cf54930
        data = 0x0
        oc = 32670
        __func__ = "tport_recv_stream_ws"
        __PRETTY_FUNCTION__ = "tport_recv_stream_ws"
#14 0x00007f9ec3c9aa13 in tport_recv_data (self=0x7f9e8cf54930) at tport/tport.c:2890
No locals.
#15 tport_recv_event (self=self@entry=0x7f9e8cf54930) at tport/tport.c:2904
        again = <optimized out>
        __func__ = "tport_recv_event"
#16 0x00007f9ec3c9abbb in tport_base_wakeup (self=0x7f9e8cf54930, events=1) at tport/tport.c:2828
        error = 0
#17 0x00007f9ec3c9ad9d in tport_wakeup (magic=<optimized out>, w=0x7f9e8e0e4e04, self=0x7f9e8cf54930) at tport/tport.c:2807
        events = 1
        error = <optimized out>
        __PRETTY_FUNCTION__ = "tport_wakeup"
        __func__ = "tport_wakeup"
#18 0x00007f9ec3c86a3a in su_epoll_port_wait_events (self=0x7f9e8c000b60, tout=<optimized out>) at su/su_epoll_port.c:510
        ser = 0x7f9e8e0e4de0
        magic = <optimized out>
        j = 0
        n = 1
        events = 0
        index = <optimized out>
        version = 70945
        M = 4
        ev = {{events = 1, data = {ptr = 0xb4, fd = 180, u32 = 180, u64 = 180}}, {events = 32670, data = {ptr = 0x653d15aa, fd = 1698502058, u32 = 1698502058, u64 = 1698502058}}, {events = 947926674, data = {ptr = 0xe8e7942a00000000, fd = 0, u32 = 0, u64 = 16782545444528979968}}, {events = 0, data = {ptr = 0x7f9ec3c88057 <su_timer_expire+855>, fd = -1010270121, u32 = 3284697175, u64 = 140319866257495}}}
        __PRETTY_FUNCTION__ = "su_epoll_port_wait_events"
#19 0x00007f9ec3c88168 in su_base_port_run (self=0x7f9e8c000b60) at su/su_base_port.c:349
        tout = 703
        tout2 = 0
        __PRETTY_FUNCTION__ = "su_base_port_run"
#20 0x00007f9ec3c8b1b3 in su_pthread_port_clone_main (varg=0x7f9ec04d4580) at su/su_pthread_port.c:343
        arg = 0x0
        task = {{sut_port = 0x7f9e8c000b60, sut_root = 0x7f9e8c0013d0}}
        zap = 1
#21 0x00007f9ec3e9f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#22 0x00007f9ec3e3f450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

@jakubkarolczyk
Copy link
Collaborator

Thanks, that was really useful. Pushed, please re-test.

@themsley-voiceflex
Copy link
Author

Rebuilding. While I wait, since you seem familiar with this code, is it meant to be multithreaded? If it is then I think I need to raise another bug! Specifically I see that thread_setup() uses CRYPTO_num_locks() which has returned only 1 since openssl 1.1.1d (or thereabouts).

Build finished, reinstalled and bounced.

@jakubkarolczyk
Copy link
Collaborator

Sofia is single thread

@themsley-voiceflex
Copy link
Author

Ok, so that is is in and running and I'll leave it there. I have an awk script that can detect gaps in the log so that will tell me later if it's hung and recovered while I wasn't looking. Hello weekend...

@themsley-voiceflex
Copy link
Author

This does appear to be much better but I think there is still a problem lurking. I set a script that watches stat against the freeswitch.log file and takes a gcore if it sees that it hasn't been updated in 120s and that fired earlier on and the bt looks very similar to the earlier ones.

Thread 18 (Thread 0x7fc1c6a7e640 (LWP 908670)):
#0  0x00007fc1dfb3e9bf in write () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007fc1df2dd8e3 in sock_write () from /lib64/libcrypto.so.3
No symbol table info available.
#2  0x00007fc1df2cd41b in bwrite_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3  0x00007fc1df2d0d26 in bio_write_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4  0x00007fc1df2d0e57 in BIO_write () from /lib64/libcrypto.so.3
No symbol table info available.
#5  0x00007fc1df72d8de in ssl3_write_pending () from /lib64/libssl.so.3
No symbol table info available.
#6  0x00007fc1df730120 in do_ssl3_write.localalias () from /lib64/libssl.so.3
No symbol table info available.
#7  0x00007fc1df70345f in ssl3_dispatch_alert () from /lib64/libssl.so.3
No symbol table info available.
#8  0x00007fc1df70e405 in ssl3_shutdown () from /lib64/libssl.so.3
No symbol table info available.
#9  0x00007fc1df8a5240 in ws_close (wsh=0x7fc1a92d9cf0, reason=reason@entry=0) at tport/ws.c:870
        rcode = <optimized out>
        ssl_error = <optimized out>
        code = <optimized out>
        n = 0
        block_n = 100
#10 0x00007fc1df8a5adf in ws_destroy (wsh=wsh@entry=0x7fc1a92d9cf0) at tport/ws.c:789
No locals.
#11 0x00007fc1df8a5b83 in tport_ws_deinit_secondary (self=0x7fc1a92d9b00) at tport/tport_type_ws.c:536
        wstp = <optimized out>
        wstp = <optimized out>
        __func__ = {<optimized out> <repeats 26 times>}
#12 tport_ws_deinit_secondary (self=0x7fc1a92d9b00) at tport/tport_type_ws.c:530
        wstp = 0x7fc1a92d9b00
        __func__ = "tport_ws_deinit_secondary"
#13 0x00007fc1df894db8 in tport_zap_secondary (self=0x7fc1a92d9b00) at tport/tport.c:1101
        mr = <optimized out>
        __func__ = "tport_zap_secondary"
#14 0x00007fc1df887e98 in su_timer_expire (timers=timers@entry=0x7fc1a8000ba8, timeout=timeout@entry=0x7fc1c6a7dc50, now=...) at su/su_timer.c:587
        t = 0x7fc1a9066510
        f = 0x7fc1df896e90 <tport_secondary_timer>
        n = 0
        __PRETTY_FUNCTION__ = "su_timer_expire"
#15 0x00007fc1df888125 in su_base_port_run (self=0x7fc1a8000b60) at su/su_base_port.c:339
        now = {tv_sec = <optimized out>, tv_usec = <optimized out>}
        tout = 15000
        tout2 = 0
        __PRETTY_FUNCTION__ = "su_base_port_run"
#16 0x00007fc1df88b1b3 in su_pthread_port_clone_main (varg=0x7fc1dc0bd580) at su/su_pthread_port.c:343
        arg = 0x0
        task = {{sut_port = 0x7fc1a8000b60, sut_root = 0x7fc1a80013d0}}
        zap = 1
#17 0x00007fc1dfa9f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#18 0x00007fc1dfa3f450 in clone3 () from /lib64/libc.so.6
No symbol table info available.

ws.c:870 is SSL_shutdown() 3 lines after the "us closes the connection" comment.

I've changed the script interval to 240s now just to make sure that if it fires again it really is stuck.

@jakubkarolczyk
Copy link
Collaborator

I suspected this can happen in theory and it happens in real life, OS is so nasty.

@jakubkarolczyk
Copy link
Collaborator

@themsley-voiceflex just pushed the patch. Please re-test.

@themsley-voiceflex
Copy link
Author

Patched and rebuilt & deployed. Watching.

@themsley-voiceflex
Copy link
Author

Current situation: I have icinga monitoring the webrtc port once a minute on this just to retrieve the SSL cert and that has been OK now for 18h+ (when the patched version was installed and freeswitch bounced). I also have a watch on freeswitch.log that tells me when it's not been written to for more than 8s and so far during working hours (09:00 GMT +) I've had a 10s response time 3 times. Both of those numbers are a lot better than they have been.

@jakubkarolczyk
Copy link
Collaborator

Just pushed the fix to non block on ws handshake. Please pull and rebuild.

@themsley-voiceflex
Copy link
Author

Did so this morning, been in and running since 2023-11-01 10:43:32 and my log watcher has reported 2 instances of freeswitch.log not being written to for 10s. One of those I can see it was still alive for, the other one at 12:55:20 it does appear that it went away for 10s at that time. So far so good.

@andywolk
Copy link
Contributor

andywolk commented Nov 6, 2023

Should be fixed by freeswitch/sofia-sip#233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants