Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux] SIGSEGV: segmentation violation during cgo execution of cgoLookupIP and getaddrinfo #41398

Open
cmacknz opened this issue Oct 23, 2024 · 11 comments
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Oct 23, 2024

We have an internal example of multiple Beats failing shortly after startup with a segmentation fault in CGO code. The exact path leading to this is not clear yet because the problem is in CGO, although we do have the stack trace which is attached.

{"log.level":"info","@timestamp":"2024-10-18T15:10:23.373Z","message":"running under elastic-agent, per-beat lockfiles disabled","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":443,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).launch"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Starting stats endpoint","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"api","log.origin":{"file.line":69,"file.name":"api/server.go","function":"github.com/elastic/beats/v7/libbeat/api.(*Server).Start"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Syscall filter successfully installed","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"seccomp","log.origin":{"file.line":125,"file.name":"seccomp/seccomp.go","function":"github.com/elastic/beats/v7/libbeat/common/seccomp.loadFilter"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Beat info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","system_info":{"beat":{"path":{"config":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components","data":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/run/filestream-monitoring","home":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components","logs":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components/logs"},"type":"filebeat","uuid":"5a0b058b-04d4-4e07-b5cd-3a4aef38a2f7"},"ecs.version":"1.6.0"},"log.logger":"beat","log.origin":{"file.line":1385,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Build info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1394,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"build":{"commit":"26daf71e4ec87172523af7f0e916cba9f79dc0d0","libbeat":"8.15.2","time":"2024-09-19T09:24:35.000Z","version":"8.15.2"},"ecs.version":"1.6.0"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.374Z","message":"Go runtime info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1397,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"ecs.version":"1.6.0","go":{"arch":"amd64","max_procs":8,"os":"linux","version":"go1.22.6"}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.375Z","message":"Host info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"system_info":{"ecs.version":"1.6.0","host":{"architecture":"x86_64","boot_time":"2024-10-18T11:12:02+02:00","containerized":false,"id":"3fe2439e8486446eabcfaac351556a64","ip":["127.0.0.1","::1","10.0.0.45","fd00::9250:6d5f:2a99:b767","fe80::2078:f5bd:8159:2e29","10.0.0.47","fd00::9402:7f04:e6ae:472c","fe80::14c1:3059:f370:301a"],"kernel_version":"6.11.3-arch1-1","mac":["f8:75:a4:52:86:80","f8:75:a4:52:86:7f","24:41:8c:35:dd:51"],"name":"antiope","native_architecture":"x86_64\n","os":{"build":"rolling","family":"arch","major":0,"minor":0,"name":"Arch Linux","patch":0,"platform":"arch","type":"linux","version":""},"timezone":"CEST","timezone_offset_sec":7200}},"log.logger":"beat","log.origin":{"file.line":1403,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.375Z","message":"Process info","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"beat","log.origin":{"file.line":1432,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.logSystemInfo"},"service.name":"filebeat","system_info":{"ecs.version":"1.6.0","process":{"capabilities":{"ambient":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"],"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read","perfmon","bpf","checkpoint_restore"]},"cwd":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/run/filestream-monitoring","exe":"/opt/Elastic/Agent/data/elastic-agent-8.15.2-621bbc/components/agentbeat","name":"agentbeat","pid":611948,"ppid":600393,"seccomp":{"mode":"filter","no_new_privs":true},"start_time":"2024-10-18T17:10:22.500+0200"}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Setup Beat: filebeat; Version: 8.15.2","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.origin":{"file.line":341,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).createBeater"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Metrics endpoint listening on: /opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock (configured: unix:///opt/Elastic/Agent/data/tmp/xTEtpJ7117ppc6OYvJCaYHbDW8mLjXGe.sock)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"api","log.origin":{"file.line":71,"file.name":"api/server.go","function":"github.com/elastic/beats/v7/libbeat/api.(*Server).Start.func1"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.376Z","message":"Output is configured through Central Management","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"service.name":"filebeat","ecs.version":"1.6.0","log.origin":{"file.line":373,"file.name":"instance/beat.go","function":"github.com/elastic/beats/v7/libbeat/cmd/instance.(*Beat).createBeater"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-10-18T15:10:23.378Z","message":"Beat name: antiope","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"log.logger":"publisher","log.origin":{"file.line":105,"file.name":"pipeline/module.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.LoadWithSettings"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"SIGSEGV: segmentation violation","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"PC=0x0 m=4 sigcode=1 addr=0x0","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-18T15:10:23.381Z","message":"signal arrived during cgo execution","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"filestream-monitoring","type":"filestream"},"log":{"source":"filestream-monitoring"},"ecs.version":"1.6.0"}

cgo_segfault.json

@cmacknz cmacknz added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Oct 23, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@cmacknz
Copy link
Member Author

cmacknz commented Oct 23, 2024

@mauri870
Copy link
Member

mauri870 commented Oct 23, 2024

Briefly looking at the logs I can see references such as net.cgoLookupHostIP, this is the C netdns resolver. We could opt-in to use the netgo resolver.

Edit: The crash seems to be triggered in the call to reflect.implements https://github.com/elastic/go-ucfg/blob/4fd3937/initializer.go#L39C29-L39C39

@cmacknz cmacknz changed the title [8.15.2][Linux] SIGSEGV: segmentation violation during cgo execution [Linux] SIGSEGV: segmentation violation during cgo execution of cgoLookupIP and getaddrinfo Oct 23, 2024
@rdner
Copy link
Member

rdner commented Oct 23, 2024

Does the issue happen if GODEBUG=netdns=go set?

@mauri870
Copy link
Member

Does the issue happen if GODEBUG=netdns=go set?

Also wondering about this. The cgo resolver uses threads so in high contention scenarios the netgo resolver might perform better by leveraging goroutines.

@cmacknz
Copy link
Member Author

cmacknz commented Oct 23, 2024

Does the issue happen if GODEBUG=netdns=go set?

Confirmed that setting GODEBUG=netdns=go stops this from happening.

@rdner
Copy link
Member

rdner commented Oct 23, 2024

There is a chance that this PR will fix it #41402
The PR updates glibc from 2.28 to 2.31.

@pierrehilbert
Copy link
Collaborator

Sorry I didn't follow up with this topic.
Did your PR fix the issue?

@rdner
Copy link
Member

rdner commented Nov 4, 2024

@pierrehilbert needs to be tested, it's quite hard to reproduce but I can try. This change was not included in 8.16 due to the product decision, so the only option is to build Filebeat from sources and run it in the Linux environment where the crash happens.

@weltenwort would you mind to share your OS configuration, so I can reproduce the environment? Or perhaps you'd be willing to test it yourself?

We need your Linux distribution, version, glibc version, etc.

@weltenwort
Copy link
Member

Hi @rdner 👋

I'm running arch with the stock kernel 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux and glibc version 2.40+r16+gaa533d58ff-2, but only because I was on PTO for a few days. The kernel will certainly have been updated by now.

If you can assist me in setting up the build environment I could certainly test it on my machine.

@rdner
Copy link
Member

rdner commented Nov 5, 2024

@pierrehilbert I just had a call with @weltenwort and the issue persists despite the glibc update (2.28 to 2.31) in Beats 8.17 binaries. Looks like the only action we can take now is to update the documentation and tell users to use GODEBUG=netdns=go if this crash occurs.

Good news is that it's a stable deterministic crash, not a flaky behavior.

Steps to reproduce

OS: ArchLinux
Linux Kernel: 6.11.5-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 22 Oct 2024 18:31:38 +0000 x86_64 GNU/Linux
glibc: 2.40+r16+gaa533d58ff-2

  1. We need to test against a remote ES, the easiest way is Elastic Cloud: create a deployment.
  2. Use this filebeat.yml configuration and follow instructions in the comments:
filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    enabled: true
    paths:
      - "/var/log/*.log" # check if you have matching files on your machine, change if necessary
path.data: "/tmp/filebeat" # so, nothing is left after the test runs

logging:
  level: debug # can be noisy but we would like to see everything


output.elasticsearch:
  # in case you build from sources, disables the compatibility check
  allow_older_versions: true
  # Create an API key, pick the Beats format (!!!) and copy it to the config file
  api_key: "<FIRST>:<SECOND>"
  # Open the deployment management and copy the Elasticsearch endpoint to the config file
  hosts: ["https://<HOSTNAME>:443"] # keep the 443 port.
  1. Run ./filebeat -e -c ./filebeat.yml 2> output.json

In case the issue is there, filebeat will stop almost instantly and you will see the following stacktrace at the end of output.json. The stacktrace is identical between 8.16 and 8.17 versions of Beats:

SIGSEGV: segmentation violation
PC=0x0 m=4 sigcode=1 addr=0x0
signal arrived during cgo execution

goroutine 52 gp=0xc00023c8c0 m=4 mp=0xc0000c1808 [syscall]:
runtime.cgocall(0x64f288625900, 0xc000f8b5a8)
	runtime/cgocall.go:157 +0x4b fp=0xc000f8b580 sp=0xc000f8b548 pc=0x64f2840a42cb
net._C2func_getaddrinfo(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
	_cgo_gotypes.go:105 +0x59 fp=0xc000f8b5a8 sp=0xc000f8b580 pc=0x64f28438e259
net._C_getaddrinfo.func1(0xc000d2ad30, 0x0, 0xc00113a270, 0xc0000be848)
	net/cgo_unix_cgo.go:78 +0x7a fp=0xc000f8b5f0 sp=0xc000f8b5a8 pc=0x64f28438ec5a
net._C_getaddrinfo(0xc0000141b0?, 0x9?, 0x0?, 0x0?)
	net/cgo_unix_cgo.go:78 +0x13 fp=0xc000f8b620 sp=0xc000f8b5f0 pc=0x64f28438eb93
net.cgoLookupHostIP({0x64f28862873e, 0x3}, {0xc0000141b0, 0x9})
	net/cgo_unix.go:168 +0x228 fp=0xc000f8b760 sp=0xc000f8b620 pc=0x64f284358028
net.cgoLookupIP.func1()
	net/cgo_unix.go:217 +0x25 fp=0xc000f8b790 sp=0xc000f8b760 pc=0x64f284358745
net.doBlockingWithCtx[...].func1()
	net/cgo_unix.go:56 +0x32 fp=0xc000f8b7e0 sp=0xc000f8b790 pc=0x64f28438efb2
runtime.goexit({})
	runtime/asm_amd64.s:1695 +0x1 fp=0xc000f8b7e8 sp=0xc000f8b7e0 pc=0x64f28411a8e1
created by net.doBlockingWithCtx[...] in goroutine 51
	net/cgo_unix.go:54 +0xd8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

6 participants