Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault at startup, release 3.2.1 #115

Open
adia opened this issue Sep 19, 2024 · 9 comments
Open

Segmentation fault at startup, release 3.2.1 #115

adia opened this issue Sep 19, 2024 · 9 comments

Comments

@adia
Copy link

adia commented Sep 19, 2024

Hello! I've been running mldonkey for some years in a Debian unstable vm. Recently, after some upgrade, it ceased starting up with a message about incompatible libraries and I discovered the package had been dropped from Debian, so I thought I could compile it locally from source. I built it via opam as follows:

opam switch create mldonkey 4.14.2
eval $(opam env --switch=mldonkey)
opam install camlp4 conf-m4 conf-zlib num
./configure --disable-multinet --enable-batch
gmake

Unfortunately it always segfaults shortly after startup, regardless of type of build chosen. Here's the backtrace:

2024/09/19 14:43:26 [cO] Logging in /home/adia/.mldonkey/mlnet.log
2024/09/19 14:43:26 [cCO] Options correctly saved
2024/09/19 14:43:26 [dMain] loading server.met from web_infos/server.met.gz
2024/09/19 14:43:26 [EDK] server.met loaded from http://www.gruk.org/server.met.gz
2024/09/19 14:43:26 [EDK] 6 servers found, 0 new ones inserted
2024/09/19 14:43:26 [dMain] loading guarding.p2p from web_infos/ipfilter.zip
2024/09/19 14:43:26 [IPblock] loading web_infos/ipfilter.zip
2024/09/19 14:43:26 [IPblock] guarding.p2p found in zip file
2024/09/19 14:43:27 [IPblock] 222266 ranges loaded - optimized to 192083
2024/09/19 14:43:27 [dMain] Check http://mldonkey.sf.net for updates
2024/09/19 14:43:27 [dMain] enabling networks:
2024/09/19 14:43:27 [dMain] ---- enabling Donkey ----
2024/09/19 14:43:27 [EDK] loading sources completed
2024/09/19 14:43:27 [dMain] using port 13966 (client_port TCP)
2024/09/19 14:43:27 [dMain] using port 13970 (client_port UDP)
2024/09/19 14:43:27 [dMain] using port 6252 (overnet_port TCP+UDP)
2024/09/19 14:43:27 [dMain] using port 15112 (kademlia_port UDP)
2024/09/19 14:43:27 [dMain] ---- enabling interfaces ----
2024/09/19 14:43:27 [dMain] using port 44080 (http_port)
2024/09/19 14:43:27 [dMain] using port 44000 (telnet_port)
2024/09/19 14:43:27 [dMain] using port 44001 (gui_port)
2024/09/19 14:43:27 [dMain] disabled networks: none
2024/09/19 14:43:27 [dMain] To command: telnet 127.0.0.1 44000
2024/09/19 14:43:27 [dMain] Or with browser: http://127.0.0.1:44080
2024/09/19 14:43:27 [dMain] For a GUI check out http://sancho-gui.sourceforge.net
2024/09/19 14:43:27 [dMain] Connect to IP 127.0.0.1, port 44001
2024/09/19 14:43:27 [dMain] If you connect from a remote machine adjust allowed_ips
2024/09/19 14:43:27 [cCO] Options correctly saved
2024/09/19 14:43:28 [dMain] Core started
[New Thread 0x7ffff3a006c0 (LWP 459427)]

Thread 1 "mlnet" received signal SIGSEGV, Segmentation fault.
0x0000555555b49998 in try_poll (fdlist=<optimized out>, timeout=<optimized out>) at src/utils/lib/stubs_c.c:136
136	       ufds[nfds].events = (must_read? POLLIN : 0) | (must_write ? POLLOUT:0);
(gdb) thread apply all bt

Thread 2 (Thread 0x7ffff3a006c0 (LWP 459427) "mlnet"):
#0  0x00007ffff7c9722e in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7ffff39ffd20, op=393, expected=0, futex_word=0x555555f24dcc <cond+44>) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x555555f24dcc <cond+44>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7ffff39ffd20, private=private@entry=0, cancel=cancel@entry=true) at./nptl/futex-internal.c:87
#2  0x00007ffff7c972ab in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555f24dcc <cond+44>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7ffff39ffd20, private=private@entry=0) at ./nptl/futex-internal.c:139
#3  0x00007ffff7c99ca5 in __pthread_cond_wait_common (abstime=0x7ffff39ffd20, clockid=0, mutex=0x555555f24d60 <mutex>, cond=0x555555f24da0 <cond>) at ./nptl/pthread_cond_wait.c:503
#4  ___pthread_cond_timedwait64 (cond=cond@entry=0x555555f24da0 <cond>, mutex=mutex@entry=0x555555f24d60 <mutex>, abstime=abstime@entry=0x7ffff39ffd20) at ./nptl/pthread_cond_wait.c:643
#5  0x0000555555b497b3 in dns_thread (arg=<optimized out>) at src/utils/lib/stubs_c.c:832
#6  0x00007ffff7c9a732 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
#7  0x00007ffff7d152b8 in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 1 (Thread 0x7ffff58f7840 (LWP 459396) "mlnet"):
#0  0x0000555555b49998 in try_poll (fdlist=<optimized out>, timeout=<optimized out>) at src/utils/lib/stubs_c.c:136
#1  0x0000555555a93745 in camlBasicSocket__loop_1095 ()
#2  0x000055555586dffb in camlCommonMain__entry ()
#3  0x000055555586a259 in caml_program ()
#4  0x0000555555b72ead in caml_start_program ()
#5  0x0000555555b7324c in caml_startup_common (argv=0x7fffffffe188, pooling=<optimized out>, pooling@entry=0) at startup_nat.c:160
#6  0x0000555555b732cb in caml_startup_exn (argv=<optimized out>) at startup_nat.c:167
#7  caml_startup (argv=<optimized out>) at startup_nat.c:172
#8  caml_main (argv=<optimized out>) at startup_nat.c:179
#9  0x00005555558692dc in main (argc=<optimized out>, argv=<optimized out>) at main.c:37

I'm attaching the configure and build logs:
configure.txt
build.txt

I'd appreciate any ideas or directions. Thanks!

@ygrek
Copy link
Owner

ygrek commented Sep 25, 2024

no ideas so far, linux 64-bit linux is a main target, this kind of crash is quite weird. Try running with debug logs, maybe there is any clue. Do you have many peers connected?

@adia
Copy link
Author

adia commented Sep 29, 2024

Not sure how to see how many peers I have connected - I don't think I've ever changed any peer-related options. In any case, I think it crashes while starting up - it shouldn't have connected to many peers at that time, correct?

I tried deleting my ~/.mldonkey dir and running it cleanly with ./mlnet -stdout -verbosity 'verb mc mr ms sm net file unk loc share connect udp swarming hc hs act unexp' -telnet_port 14000 -gui_port 14001 -http_port 14080 (some of the default ports are taken by other servers). I used the -verbosity option to get fuller debug logs. Unfortunately I got the same segfault and the log didn't have any significant differences. Also, I noticed some commented-out fprintf's around the location of the crash (src/utils/lib/stubs_c.c:136), and I enabled them. The last lines before the crash now were:

2024/09/29 08:59:13 [Ux32] flush all
2024/09/29 08:59:13 [cCO] Options correctly saved
2024/09/29 08:59:13 [dMain] Activated system signal handling
2024/09/29 08:59:13 [dMain] Starting with pid 4051471
2024/09/29 08:59:13 [dMain] Core started
FD in POLL added 3
Segmentation fault

In case the OCaml compiler version made a difference, I tried building with 4.10.2 and 4.08.0 (earlier ones wouldn't install for me, I guess because the rest of the system is too new), with the same results...

@ygrek
Copy link
Owner

ygrek commented Oct 1, 2024

please show what the following patch outputs

diff --git a/src/utils/lib/stubs_c.c b/src/utils/lib/stubs_c.c
index 73d7273c..382c1e03 100644
--- a/src/utils/lib/stubs_c.c
+++ b/src/utils/lib/stubs_c.c
@@ -42,6 +42,8 @@
 #define read XXXXXXXXX
 #define ftruncate XXXXXXXXX
 
+#include <stdio.h>
+#include <errno.h>
 
 
 /*******************************************************************
@@ -115,6 +117,7 @@ value try_poll(value fdlist, value timeout) /* ML */
   
   if(ufds == NULL){
     ufds_size = os_getdtablesize();
+    fprintf(stderr, "ufds_size %d errno %d\n", ufds_size, errno);
     ufds = (struct pollfd*) malloc (sizeof(struct pollfd) * ufds_size);
     pfds = (value*) malloc (sizeof(value) * ufds_size);
   }

@adia
Copy link
Author

adia commented Oct 1, 2024

Here you go:

2024/10/01 13:13:13 [dMain] Activated system signal handling
2024/10/01 13:13:13 [dMain] Starting with pid 90122
2024/10/01 13:13:13 [dMain] Core started
ufds_size 1073741816 errno 2
ufds_size 1073741816 errno 12
FD in POLL added 3
Segmentation fault

@ygrek
Copy link
Owner

ygrek commented Oct 1, 2024

that is a very generous file limit 🤔
what is your ulimit -n (in the session where mldonkey runs)

@adia
Copy link
Author

adia commented Oct 1, 2024

I used to run it via a systemd user unit, but until it works again, I'm running it from a normal login shell. There, "ulimit -n" returns 1024. But I noticed the following in the debug log:

2024/10/01 17:37:18 [cCO] pass 1: checking max_opened_connections = 200 for validity
2024/10/01 17:37:18 [cCO] pass 1: file descriptors status: total allowed (ulimit -n) 1073741816
2024/10/01 17:37:18 [cCO] pass 1: - max_opened_connections 200 (30% indirect)
2024/10/01 17:37:18 [cCO] pass 1: - file cache size 789200097
2024/10/01 17:37:18 [cCO] pass 1: - reserved 21474836
2024/10/01 17:37:18 [cCO] pass 1: = 263066683 descriptors left
2024/10/01 17:37:18 [cCO] pass 1: checking max_opened_connections finished

@ygrek
Copy link
Owner

ygrek commented Oct 2, 2024

do you mean it works without segfault in the normal login shell with 1024 ulimit? can you still show ulimit for systemd unit?

@adia
Copy link
Author

adia commented Oct 3, 2024

Sorry I wasn't clear, I meant that since I couldn't get it to run from the login shell, I hadn't tried starting it via systemd. I just tried it and the results are the same (also ulimit -n is still 1024 when started from systemd).

But I hadn't tested upping the ulimit -n value. I tried with ulimit -n 1000000 and it seems to have started without problems. It's been running fine for the last five minutes. Also, the debug log now reports the correct value (i.e. 1000000) instead of 1073741816.

So, was the 1024 limit too low or the cause is something else which was causing it to get an incorrect value? If I'm following the code right, I think it's the result of getdtablesize(), whose man page states:

The glibc version of getdtablesize() calls getrlimit(2) and returns the current RLIMIT_NOFILE limit, or OPEN_MAX when that fails.
Portable applications should employ sysconf(_SC_OPEN_MAX) instead of this call.

Maybe that's what was happening?

@ygrek
Copy link
Owner

ygrek commented Oct 3, 2024

yes, apparently something with limits, but idu exactly what
actually wait, 1073741816 is too much and malloc likely fails and there is no error checking, hence segfault, please confirm with :

diff --git a/src/utils/lib/stubs_c.c b/src/utils/lib/stubs_c.c
index 73d7273c..3588e92f 100644
--- a/src/utils/lib/stubs_c.c
+++ b/src/utils/lib/stubs_c.c
@@ -42,6 +42,8 @@
 #define read XXXXXXXXX
 #define ftruncate XXXXXXXXX
 
+#include <stdio.h>
+#include <errno.h>
 
 
 /*******************************************************************
@@ -115,8 +117,11 @@ value try_poll(value fdlist, value timeout) /* ML */
   
   if(ufds == NULL){
     ufds_size = os_getdtablesize();
+    fprintf(stderr, "ufds_size %d errno %d\n", ufds_size, errno);
     ufds = (struct pollfd*) malloc (sizeof(struct pollfd) * ufds_size);
+    fprintf(stderr, "ufds %p errno %d\n", ufds, errno);
     pfds = (value*) malloc (sizeof(value) * ufds_size);
+    fprintf(stderr, "pfds %p errno %d\n", pfds, errno);
   }
   

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants