Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snis_server using 100% cpu moving NPC ships #287

Open
smcameron opened this issue Jun 12, 2020 · 9 comments
Open

snis_server using 100% cpu moving NPC ships #287

smcameron opened this issue Jun 12, 2020 · 9 comments

Comments

@smcameron
Copy link
Owner

Caught snis_server using 100% cpu...

(gdb) bt
#0  0x000000000049c477 in space_partition_process ()
#1  0x000000000043af4d in process_potential_victim ()
#2  0x000000000049c480 in space_partition_process ()
#3  0x0000000000483868 in ship_move ()
#4  0x000000000040c602 in main ()
(gdb) bt
#0  0x000000000043b7ad in compute_danger_vectors ()
#1  0x000000000049c480 in space_partition_process ()
#2  0x0000000000482404 in ship_move ()
#3  0x000000000040c602 in main ()
(gdb) bt
#0  0x000000000043d6c0 in ship_security_avoidance ()
#1  0x000000000049c480 in space_partition_process ()
#2  0x000000000043af4d in process_potential_victim ()
#3  0x000000000049c480 in space_partition_process ()
#4  0x0000000000482317 in ship_move ()
#5  0x000000000040c602 in main ()

Not clear what's going on.

@smcameron
Copy link
Owner Author

A bit better backtrace (compiled with -g):

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00000000004a9fd1 in process_cell (p=0x2da2d80, entity=0x131a370 <go+7451056>, x=-248980.68216628116, y=-242912.58318146924, context=0x131a370 <go+7451056>, 
    fn=0x450da2 <ship_security_avoidance>, cell=655) at space-part.c:191
191			guy = ((unsigned char *) i) - p->offset;
(gdb) bt
#0  0x00000000004a9fd1 in process_cell (p=0x2da2d80, entity=0x131a370 <go+7451056>, x=-248980.68216628116, y=-242912.58318146924, context=0x131a370 <go+7451056>, 
    fn=0x450da2 <ship_security_avoidance>, cell=655) at space-part.c:191
#1  0x00000000004aa0c5 in space_partition_process (p=0x2da2d80, entity=0x131a370 <go+7451056>, x=-248980.68216628116, y=-242912.58318146924, 
    context=0x131a370 <go+7451056>, fn=0x450da2 <ship_security_avoidance>) at space-part.c:211
#2  0x0000000000441bf6 in process_potential_victim (context=0x7ffe2a35fe10, entity=0x131a370 <go+7451056>) at snis_server.c:3482
#3  0x00000000004a9ff3 in process_cell (p=0x2da2d80, entity=0x13f0270 <go+8327344>, x=-219584.94151009578, y=-258303.13463197066, context=0x7ffe2a35fe10, 
    fn=0x441adb <process_potential_victim>, cell=615) at space-part.c:192
#4  0x00000000004aa0c5 in space_partition_process (p=0x2da2d80, entity=0x13f0270 <go+8327344>, x=-219584.94151009578, y=-258303.13463197066, context=0x7ffe2a35fe10, 
    fn=0x441adb <process_potential_victim>) at space-part.c:211
#5  0x0000000000441f22 in find_nearest_victim (o=0x13f0270 <go+8327344>) at snis_server.c:3532
#6  0x0000000000448bb6 in calculate_threat_level (o=0x13f0270 <go+8327344>) at snis_server.c:5237
#7  0x000000000044a7aa in ai_flee_mode_brain (o=0x13f0270 <go+8327344>) at snis_server.c:5658
#8  0x0000000000450aef in ai_brain (o=0x13f0270 <go+8327344>) at snis_server.c:7192
#9  0x0000000000451bbc in ship_move (o=0x13f0270 <go+8327344>) at snis_server.c:7539
#10 0x0000000000492239 in move_objects (absolute_time=1592015263.2853091, discontinuity=1) at snis_server.c:25809
#11 0x00000000004a21ec in main (argc=9, argv=0x7ffe2a361118) at snis_server.c:30785

However, then I "continued", and now cannot break into the program again.

I notice space_partition_process is getting called recursively, which is a bit odd, leading to, at best, O(n^2) behavior.

@smcameron
Copy link
Owner Author

By "kill -2 pid-of-snis_server" a few times, I was able to break in with gdb again.

@smcameron
Copy link
Owner Author

I noticed:

(gdb) print pool->highest_object_number
$6 = 4999
...
grep MAX.*OBJ *.h
snis.h:#define MAXGAMEOBJS 5000

So objects are maxed out. Not sure if that's relevant.

Noticed this in the snis_server log a few times:

snis_server(default): BUG detected at snis_server.c:snis_queue_delete_object_helper:1969, o->id == -1, o->type = 1
snis_server(default): BUG detected at snis_server.c:snis_queue_delete_object_helper:1969, o->id == -1, o->type = 1
snis_server(default): BUG detected at snis_server.c:snis_queue_delete_object_helper:1969, o->id == -1, o->type = 1
snis_server(default): BUG detected at snis_server.c:snis_queue_delete_object_helper:1969, o->id == -1, o->type = 1

Might be related to hitting MAXGAMEOBJS limit.

@smcameron
Copy link
Owner Author

smcameron commented Dec 25, 2020

FWIW, the condition of snis_server consuming large amounts of CPU after a long time running can be cleared by issuing "clear all" or "regenerate" commands on the console. I suspect the issue is due to, over time, generating large amounts of persistent objects.

Captured a graph of network stats...

problem

Notice at the beginning, bytes/sec sent by server is 426k, and latency is 140ms. Then both drop to near zero (after doing "clear all") then resume more "normal" levels (after "regenerate"). So, if you just let snis_server run, it seems to accumulate a lot of objects, which generate a lot of network traffic.

@smcameron
Copy link
Owner Author

If you run with autowrangling turned on in snis_multiverse, then the problem is much more serious, as multiple snis_servers will begin consuming 100% CPU, and soon the entire system becomes completely unresponsive.

@smcameron
Copy link
Owner Author

Hmm, it seems when running autowrangling, snis_server runs into problems that bind() on the socket listening for incoming connections always gets 'Address already in use' and a bunch of them get stuck trying to create the listener thread in a tight loop.

@smcameron
Copy link
Owner Author

Huh, they all seem to trying only 2 ports, 49950 and 49951.

If I change this line:

port = (snis_randn(INT_MAX) % (snis_server_port_max - snis_server_port_min)) + snis_server_port_min;

to this:

port = snis_randn(snis_server_port_max - snis_server_port_min) + snis_server_port_min;

The problem goes away. Something weird must happen when calling snis_randn(INT_MAX); On testing it, it either returns 0 or 65535. snis_randn() is only meant for ints up to SNIS_RAND_MAX, which is 32767.

So, now I need to audit all uses of snis_randn() to make sure they're all kosher. Or, improve snis_randn().

@smcameron
Copy link
Owner Author

smcameron commented Apr 21, 2021

The bug in the port choosing code was introduced in commit 0461667
"Allow restricting snis_server port numbers to custom range"

@smcameron
Copy link
Owner Author

  • c5974bd Fix bug in port number choosing code in snis_server listener_thread_fn()

This should fix the port number problem with autowrangling. I believe this is unrelated to the CPU consumption problem this bug originally mentions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant