Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: upgrade from 1.3.2 to 1.4.* seems to cause segmentation fault when MongoDB Exception was thrown #2898

Closed
mirkojoshua opened this issue Oct 16, 2024 · 14 comments · Fixed by #2901
Labels
🐛 bug Something isn't working

Comments

@mirkojoshua
Copy link

mirkojoshua commented Oct 16, 2024

Bug report

Hello,
When i try to save a mongo document twice i expect a MongoDb\Driver\Exception\BulkWriteException with code 11000 due to a violation of a unique index (in my case an email field). The exception is catched succesfully by my try/catch with ddtrace 1.3.2 installed or without ddtrace, but with ddtrace 1.4.* the exception do not reach my catch and php fails with a segmentation fault.

The frequency of the segmentation fault seem to vary due to the location of the trhowed exception:

  • unit/funcional test: 100% of fault
  • message consumer: 100% of fault
  • controller: is strange. 4 success and 2 fault, 4 success and 2 fault, and so on...

PHP version

8.2.24

Tracer or profiler version

1.4.*

Installed extensions

[PHP Modules]
amqp
apcu
bcmath
calendar
Core
csv
ctype
curl
date
ddappsec
ddtrace
dom
fileinfo
filter
ftp
gd
gmp
hash
iconv
igbinary
imagick
intl
json
libxml
mbstring
memcached
mongodb
mysqli
mysqlnd
openssl
pcntl
pcov
pcre
PDO
pdo_mysql
pdo_pgsql
pdo_sqlite
Phar
posix
random
readline
redis
Reflection
session
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
tokenizer
uuid
xdebug
xml
xmlreader
xmlwriter
xsl
yaml
Zend OPcache
zip
zlib

[Zend Modules]
Xdebug
Zend OPcache
ddappsec
ddtrace

Output of phpinfo()

Image
Image
Image
Image
Image
Image
Image
Image
Image

Upgrading from

1.3.2

@mirkojoshua mirkojoshua added the 🐛 bug Something isn't working label Oct 16, 2024
@mirkojoshua
Copy link
Author

mirkojoshua commented Oct 16, 2024

Hint: for a rapid search, find email_index_unique dup key text into trace log files :)

@mirkojoshua
Copy link
Author

mirkojoshua commented Oct 16, 2024

this is the strage behavior of the controller:

172.18.0.10 -  16/Oct/2024:09:26:18 +0000 "GET /index.php" 200
2024-10-16T09:26:19.921650555Z 172.18.0.10 -  16/Oct/2024:09:26:19 +0000 "GET /index.php" 200
2024-10-16T09:26:20.646250676Z 172.18.0.10 -  16/Oct/2024:09:26:20 +0000 "GET /index.php" 200
2024-10-16T09:26:21.350051505Z 172.18.0.10 -  16/Oct/2024:09:26:21 +0000 "GET /index.php" 200
2024-10-16T09:26:22.229083097Z [16-Oct-2024 09:26:22] WARNING: [pool www] child 162 exited on signal 11 (SIGSEGV - core dumped) after 9.519361 seconds from start
2024-10-16T09:26:22.229875118Z [16-Oct-2024 09:26:22] NOTICE: [pool www] child 174 started
2024-10-16T09:26:39.487728564Z [16-Oct-2024 09:26:39] WARNING: [pool www] child 166 exited on signal 11 (SIGSEGV - core dumped) after 26.260312 seconds from start
2024-10-16T09:26:39.488601046Z [16-Oct-2024 09:26:39] NOTICE: [pool www] child 180 started
2024-10-16T09:26:40.299323007Z 172.18.0.10 -  16/Oct/2024:09:26:40 +0000 "GET /index.php" 200
2024-10-16T09:26:41.176908283Z 172.18.0.10 -  16/Oct/2024:09:26:41 +0000 "GET /index.php" 200
2024-10-16T09:26:41.823236903Z 172.18.0.10 -  16/Oct/2024:09:26:41 +0000 "GET /index.php" 200
2024-10-16T09:26:42.603944148Z 172.18.0.10 -  16/Oct/2024:09:26:42 +0000 "GET /index.php" 200
2024-10-16T09:26:43.506537471Z [16-Oct-2024 09:26:43] WARNING: [pool www] child 174 exited on signal 11 (SIGSEGV - core dumped) after 21.276563 seconds from start
2024-10-16T09:26:43.507513048Z [16-Oct-2024 09:26:43] NOTICE: [pool www] child 188 started
2024-10-16T09:27:57.673566175Z [16-Oct-2024 09:27:57] WARNING: [pool www] child 180 exited on signal 11 (SIGSEGV - core dumped) after 78.184707 seconds from start
2024-10-16T09:27:57.674301406Z [16-Oct-2024 09:27:57] NOTICE: [pool www] child 192 started
2024-10-16T09:28:19.499325270Z 172.18.0.10 -  16/Oct/2024:09:28:17 +0000 "GET /index.php" 200
2024-10-16T09:28:26.094245502Z 172.18.0.10 -  16/Oct/2024:09:28:25 +0000 "GET /index.php" 200
2024-10-16T09:28:27.376464239Z 172.18.0.10 -  16/Oct/2024:09:28:27 +0000 "GET /index.php" 200
2024-10-16T09:28:28.602915017Z 172.18.0.10 -  16/Oct/2024:09:28:28 +0000 "GET /index.php" 200
2024-10-16T09:28:29.861705253Z [16-Oct-2024 09:28:29] WARNING: [pool www] child 188 exited on signal 11 (SIGSEGV - core dumped) after 106.354369 seconds from start
2024-10-16T09:28:29.862567471Z [16-Oct-2024 09:28:29] NOTICE: [pool www] child 200 started
2024-10-16T09:31:28.863845231Z [16-Oct-2024 09:31:28] WARNING: [pool www] child 192 exited on signal 11 (SIGSEGV - core dumped) after 211.189485 seconds from start
2024-10-16T09:31:28.864649549Z [16-Oct-2024 09:31:28] NOTICE: [pool www] child 204 started
2024-10-16T09:31:29.821935282Z 172.18.0.10 -  16/Oct/2024:09:31:29 +0000 "GET /index.php" 200
2024-10-16T09:31:31.582381571Z 172.18.0.10 -  16/Oct/2024:09:31:31 +0000 "GET /index.php" 200
2024-10-16T09:31:32.553924568Z 172.18.0.10 -  16/Oct/2024:09:31:32 +0000 "GET /index.php" 200
2024-10-16T09:31:33.548798096Z 172.18.0.10 -  16/Oct/2024:09:31:33 +0000 "GET /index.php" 200
2024-10-16T09:31:34.824880390Z [16-Oct-2024 09:31:34] WARNING: [pool www] child 200 exited on signal 11 (SIGSEGV - core dumped) after 184.962233 seconds from start
2024-10-16T09:31:34.825695473Z [16-Oct-2024 09:31:34] NOTICE: [pool www] child 212 started
2024-10-16T09:31:36.009545698Z [16-Oct-2024 09:31:36] WARNING: [pool www] child 204 exited on signal 11 (SIGSEGV - core dumped) after 7.144963 seconds from start
2024-10-16T09:31:36.010515030Z [16-Oct-2024 09:31:36] NOTICE: [pool www] child 216 started
2024-10-16T09:31:37.046215263Z 172.18.0.10 -  16/Oct/2024:09:31:36 +0000 "GET /index.php" 200
2024-10-16T09:31:38.095714889Z 172.18.0.10 -  16/Oct/2024:09:31:37 +0000 "GET /index.php" 200
2024-10-16T09:31:39.018205480Z 172.18.0.10 -  16/Oct/2024:09:31:38 +0000 "GET /index.php" 200
2024-10-16T09:31:39.929339835Z 172.18.0.10 -  16/Oct/2024:09:31:39 +0000 "GET /index.php" 200
2024-10-16T09:31:41.076073256Z [16-Oct-2024 09:31:41] WARNING: [pool www] child 212 exited on signal 11 (SIGSEGV - core dumped) after 6.250307 seconds from start
2024-10-16T09:31:41.076813600Z [16-Oct-2024 09:31:41] NOTICE: [pool www] child 224 started
2024-10-16T09:31:42.118454201Z [16-Oct-2024 09:31:42] WARNING: [pool www] child 216 exited on signal 11 (SIGSEGV - core dumped) after 6.107952 seconds from start
2024-10-16T09:31:42.119300946Z [16-Oct-2024 09:31:42] NOTICE: [pool www] child 228 started

@bluezod
Copy link

bluezod commented Oct 16, 2024

same.
root user's datadog-ipc-helper process crashed recursively and produced bunch of core.[PID] dump files at the root path every minute

@bwoebi
Copy link
Collaborator

bwoebi commented Oct 16, 2024

@bluezod Is it possible for you to open these core dumps with gdb and give us a backtrace?

Like gdb /path/to/ddtrace.so -c /path/to/core-file, then bt.
In case gdb hangs when starting, you might need to run the application with DD_SPAWN_WORKER_USE_EXEC=1, then execute gdb on the newly generated core dumps.

@mirkojoshua
Copy link
Author

into my /tmp folder i see only

  • coredump-php-fpm.[PID]
  • coredump-php.[PID]

with gdb /usr/local/lib/php/extensions/no-debug-non-zts-20220829/ddtrace.so /tmp/coredump-php.161 the output is

(gdb) bt
#0  __restore_sigs (set=set@entry=0x7ffecadd95f0) at ./arch/x86_64/syscall_arch.h:40
#1  0x00007d385df46845 in raise (sig=<optimized out>) at src/signal/raise.c:11
#2  0x00007d385c2bbbdd in datadog_crashtracker::collector::crash_handler::handle_posix_sigaction () at libdatadog/crashtracker/src/collector/crash_handler.rs:298
#3  <signal handler called>
#4  0x00005eec271e33ca in ?? ()
#5  0x00007d38577a4660 in ?? ()
#6  0xf9826bf705660024 in ?? ()
#7  0x00007ffecadda638 in ?? ()
#8  0x00007d3856e02460 in ?? ()
#9  0x00005eec2801e7e0 in ?? ()
#10 0x0000000000000000 in ?? ()

I have never used gdb, so I don't know if this can be useful to you.

@mirkojoshua mirkojoshua changed the title [Bug]: upgrade from 1.3.2 to 1.4.* seems to cause segmentation fault when MongoDB Exception whas thrown [Bug]: upgrade from 1.3.2 to 1.4.* seems to cause segmentation fault when MongoDB Exception was thrown Oct 16, 2024
@bluezod
Copy link

bluezod commented Oct 17, 2024

Thanks for the response @bwoebi

Here is the output I got following your debug instructions:

root@xxxxxxxx-yyyyyy:/# gdb /usr/lib/php/20210902/ddtrace.so -c /core.9378
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.2) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib/php/20210902/ddtrace.so...

warning: core file may not match specified executable file.
[New LWP 9378]
[New LWP 9384]

warning: Could not load shared library symbols for /tmp/.tmpWFV5Bk.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Dwarf Error: DW_FORM_strx1 found in non-DWO CU [in module /opt/datadog/dd-library/1.4.1/lib/libddappsec-helper.so]
Core was generated by `datadog-ipc-helper /tmp/.tmp0dpXyd /usr/lib/php/20210902/ddtrace.so - /tmp/.tmp'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f71fc769b80 (LWP 9378))]
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f71fc78d859 in __GI_abort () at abort.c:79
#2  0x00007f71fc3fbe2a in dds::runner::~runner() () from /opt/datadog/dd-library/1.4.1/lib/libddappsec-helper.so
#3  0x00007f71fc334791 in std::__1::shared_ptr<dds::runner>::~shared_ptr[abi:v160006]() () from /opt/datadog/dd-library/1.4.1/lib/libddappsec-helper.so
#4  0x00007f71fc7b1fde in __cxa_finalize (d=0x7f71fc5516f0 <__dso_handle>) at cxa_finalize.c:83
#5  0x00007f71fc333235 in __do_fini () from /opt/datadog/dd-library/1.4.1/lib/libddappsec-helper.so
#6  0x00007f71fcaf7952 in call_destructors (closure=closure@entry=0x21f4c030) at dl-close.c:125
#7  0x00007f71fc8cbba5 in __GI__dl_catch_exception (exception=exception@entry=0x0, operate=operate@entry=0x7f71fcaf7910 <call_destructors>, args=args@entry=0x21f4c030) at dl-error-skeleton.c:182
#8  0x00007f71fcaf7f36 in _dl_close_worker (map=map@entry=0x21f4c030, force=force@entry=false) at dl-close.c:297
#9  0x00007f71fcaf8f05 in _dl_close_worker (force=false, map=0x21f4c030) at dl-close.c:145
#10 _dl_close (_map=0x21f4c030) at dl-close.c:859
#11 0x00007f71fc8cbb48 in __GI__dl_catch_exception (exception=exception@entry=0x7ffe6687db00, operate=operate@entry=0x7f71fcad0420 <dlclose_doit>, args=args@entry=0x21f4c030) at dl-error-skeleton.c:208
#12 0x00007f71fc8cbc13 in __GI__dl_catch_error (objname=objname@entry=0x21f4b2d0, errstring=errstring@entry=0x21f4b2d8, mallocedp=mallocedp@entry=0x21f4b2c8, operate=operate@entry=0x7f71fcad0420 <dlclose_doit>,
    args=args@entry=0x21f4c030) at dl-error-skeleton.c:227
#13 0x00007f71fcad0b59 in _dlerror_run (operate=operate@entry=0x7f71fcad0420 <dlclose_doit>, args=0x21f4c030) at dlerror.c:170
#14 0x00007f71fcad0468 in __dlclose (handle=<optimized out>) at dlclose.c:46
#15 0x000000000040087c in <F as core::future::into_future::IntoFuture>::into_future () at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/into_future.rs:138
#16 hyper::client::client::Client<C,B>::send_request::{{closure}} () at /rust/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/client/client.rs:294
#17 hyper::client::client::Client<C,B>::retryably_send_request::{{closure}} () at /rust/cargo/registry/src/index.crates.io-6f17d22bba15001f/hyper-0.14.28/src/client/client.rs:209
#18 0x0000000000000000 in ?? ()
(gdb)

@neoty
Copy link

neoty commented Oct 17, 2024

I am using Kubernetes, and when deploying the latest version, memory fills up in several pods sequentially within 10 minutes, causing OOM errors. The cause is unknown, but it does not occur in version 1.3.1 for now.

@bwoebi
Copy link
Collaborator

bwoebi commented Oct 17, 2024

@neoty We have observed (via our telemetry) that for some users the dd-ipc-helper process is using too much memory (but we do not know why yet). We do kill the process, once it reaches 1 GB of RSS memory used as a stop-gap measure, but maybe that's too much for you?
What numbers of used memory do you observe?
Also, is it possible for you to run php -r 'var_dump(dd_trace_internal_fn("stats_sidecar"));' once memory gets high? It's supposed to tell us which resources are being used.

@bwoebi
Copy link
Collaborator

bwoebi commented Oct 17, 2024

@bluezod Looks like it's related to an abort() from the appsec helper in

- we'll have a look at handling it more gracefully. Thanks for the stacktrace!

@neoty
Copy link

neoty commented Oct 17, 2024

Currently, due to stable operations and policies in place, additional testing is challenging (apologies for not being able to assist further).

However, we do have the following information:

1.	We are using PHP-FPM.
2.	During initial setup, we tested container memory from 2GB up to 8GB (OOM occurred in all cases).
3.	If there are 10 pods, memory spikes randomly in one of the pods within a few minutes or tens of minutes, causing an OOM.
4.	When all other variables were controlled and only the Datadog version was changed, the issue did not occur with version 1.3.

@bwoebi
Copy link
Collaborator

bwoebi commented Oct 17, 2024

@mirkojoshua

==23435== Conditional jump or move depends on uninitialised value(s)
==23435==    at 0xAD7FF4: zend_std_write_property (zend_object_handlers.c:834)
==23435==    by 0x9E4477: zend_update_property (zend_API.c:4653)
==23435==    by 0xB2AFBDF: phongo_add_exception_prop (phongo_error.c:35)
==23435==    by 0xB2B14AF: phongo_execute_bulk_write (phongo_execute.c:327)
==23435==    by 0xB2EA8F3: zim_MongoDB_Driver_Manager_executeBulkWrite (Manager.c:590)
==23435==    by 0xA18E37: ZEND_DO_FCALL_SPEC_OBSERVER_HANDLER (zend_vm_execute.h:2061)
==23435==    by 0xA9D7E7: execute_ex (zend_vm_execute.h:56115)
==23435==    by 0xAA1ECB: zend_execute (zend_vm_execute.h:60439)
==23435==    by 0x9D21A3: zend_execute_scripts (zend.c:1840)
==23435==    by 0x923987: php_execute_script (main.c:2565)
==23435==    by 0xB66507: do_cli (php_cli.c:964)
==23435==    by 0xB6711B: main (php_cli.c:1333)

Thanks for reporting, I managed to reproduce the issue (which we saw in our crashtracking telemetry). I'll work on fixing it.

bwoebi added a commit that referenced this issue Oct 17, 2024
They're shared across all inheritors on PHP 8.1+ and thus only need to be incremented on the specific ce.

This issue arises when a property from an inherited internal class extending Exception is itself inherited by another class.

Also fix an accidentally discovered use-after-free with peer.service.

Fixes #2898.

Signed-off-by: Bob Weinand <[email protected]>
@bwoebi bwoebi closed this as completed in ed256f3 Oct 17, 2024
@bwoebi
Copy link
Collaborator

bwoebi commented Oct 17, 2024

@neoty I'll try to add more metrics (so that we can tackle that), but it's sadly nothing I can reproduce. :-/

@bluezod Thanks, we're preparing a fix for this in #2900.

@bwoebi
Copy link
Collaborator

bwoebi commented Oct 21, 2024

@bluezod @mirkojoshua We've released 1.4.2 which will fixes both things.

@neoty I suppose what you observe matches #2902. I'll track that there.

@mirkojoshua
Copy link
Author

mirkojoshua commented Oct 21, 2024

@bwoebi (my prev tag was wrong :p)
yes, 1.4.2 seems to fix the issue!
Now my tests with 1.4.2 are all green

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants