-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] feat(proxy-wasm) host-managed property getters and setters using the Lua bridge #441
Conversation
This feature adds support for host-managed property getter and setter functions. These feature is implemented in three stages: * `ngx_proxy_wasm_properties.c`: in our implementation for proxy-wasm properties, we now allow the host to specify a function which gets called whenever proxy-wasm receives a `set_property` or `get_property` call matching the host-defined namespace. Those getter and setter functions are then responsible to return and store the values, and are able to produce any desired side-effects in the host. * `ngx_wasm_lua_ffi.c`: We then expose this functionality in our Lua FFI interface. This is a lightweight wrapper, which accepts a C function pointer directly, and associates this function to the request context object. * `lib/resty/wasmx/proxy_wasm.lua`: Finally, we expose a Lua API for setting a Lua functions as setters and getters. The interface is such that the host Lua program should set one setter function and/or getter function, and then within that function properly dispatch the behavior based on the received key. Internally, these functions use the fact that LuaJIT is able to convert a Lua function into a C function pointer. Due to lifetime management, those generated C callbacks are a limited resource in the LuaJIT VM, so we produce a single pair of these and then have them call the user-defined Lua function, which remains as a normal Lua closure. (See http://luajit.org/ext_ffi_semantics.html for more details on LuaJIT FFI C callback managment.) To ensure that the lifetime of strings produced by the host continue valid by the time the getter callback finished running and the Wasm engine reads the properties, getter/setter-produced values continue to be stored in the rbtree for host properties that we already had for plain `get_property`/`set_property`, introduced in da82087. Since we need to store this data on the C side anyway, we also add an extra feature as an optimization: the concept of "const" host-properties. If the Lua callback returns a truthy third-argument, that signals that that value should be flagged as constant. Further calls to `get_property` during the same request will get the property value directly from the ngx_wasm_module rbtree "cache" without hitting the getter callback again. Test cases in the test suite demonstrate this feature. In this PR, the host namespace continues to be statically defined at compile time via the NGX_WASM_HOST_PROPERTY_NAMESPACE define. The regular FFI functions for `set_property` and `get_property` introduced in 447ef13 also continue to work normally.
The Kong Gateway CI started seeing intermittent crashes with "bad callback" errors: See: https://luajit.freelists.narkive.com/sdhSLJSr/how-to-make-bad-callback-more-deterministic ``` The bad callback error happens because some JIT-compiled Lua code calls a C function which in turn calls an FFI callback. ``` https://lua-l.lua.narkive.com/qXJrNlpP/luajit-ffi-windows-bad-callback-error-in-msgwaitformultipleobjects-proof-of-concept From Mike Pall: ``` The problem is that a FFI callback cannot safely be called from a C function which is itself called via the FFI from JIT-compiled code. In your case this is the call to MsgWaitForMultipleObjects. I've put in a lot of heuristics to detect this, and it usually succeeds in disabling compilation for such a function. However in your case the loop is compiled before the callback is ever called, so the detection fails. The straighforward solution is to put the message loop into an extra Lua function and use jit.off(func) ``` See
lib/resty/wasmx/proxy_wasm.lua
Outdated
|
||
local rc = C.ngx_http_wasm_ffi_get_property(r, ckey, cvalue) | ||
local rc = tonumber(rcp[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me the issues we've been seeing prior to this change were maybe more likely to be related to an error in stack manipulation of L
or cur_co
than anything else... I spent some time looking at it but it warrants more investigation since something could be wrong in the Lua bridge and we never caught it through the resolver chunk due to it having a different number of arguments/return values (nrets
lua_resume arg).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe more likely to be related to an error in stack manipulation of L or cur_co than anything else...
I had a similar hunch. Reading the Lua bridge and the sources for the OpenResty functions it calls confused me a bit because I was seeing the nargs
value being passed to an argument called nrets
.
And yeah, this commit efe1c31 was just an attempt at an ugly workaround. I was drawn to it because i was consistently seeing backtraces with lj_BC_RET
on top:
(gdb) bt
#0 0x00007f9315407633 in lj_BC_RET () from /home/hisham/work/kong/ngx_wasm_module/work/buildroot/prefix/luajit/lib/libluajit-5.1.so.2
#1 0x00005573496c3e5c in ngx_http_lua_run_thread (L=0x7f9313fe0380, r=0x55734c3cf090, ctx=0x55734ade7ea0, nrets=0) at ../ngx_lua-0.10.21/src/ngx_http_lua_util.c:1185
#2 0x00005573496cd612 in ngx_http_lua_access_by_chunk (L=0x7f9313fe0380, r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:337
#3 0x00005573496cd1ea in ngx_http_lua_access_handler_inline (r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:185
#4 0x00005573496cd123 in ngx_http_lua_access_handler (r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:158
#5 0x00005573495bc512 in ngx_http_core_access_phase (r=0x55734c3cf090, ph=0x55734adede18) at src/http/ngx_http_core_module.c:1110
#6 0x00005573495bbb78 in ngx_http_core_run_phases (r=0x55734c3cf090) at src/http/ngx_http_core_module.c:885
#7 0x00005573495bbae1 in ngx_http_handler (r=0x55734c3cf090) at src/http/ngx_http_core_module.c:868
#8 0x00005573495cca05 in ngx_http_process_request (r=0x55734c3cf090) at src/http/ngx_http_request.c:2120
#9 0x00005573495cb076 in ngx_http_process_request_headers (rev=0x55734c202af0) at src/http/ngx_http_request.c:1498
#10 0x00005573495ca3b2 in ngx_http_process_request_line (rev=0x55734c202af0) at src/http/ngx_http_request.c:1165
#11 0x00005573495c897d in ngx_http_wait_request_handler (rev=0x55734c202af0) at src/http/ngx_http_request.c:503
#12 0x00005573495a2f15 in ngx_epoll_process_events (cycle=0x55734adc3030, timer=60000, flags=1) at src/event/modules/ngx_epoll_module.c:901
#13 0x000055734958eb39 in ngx_process_events_and_timers (cycle=0x55734adc3030) at src/event/ngx_event.c:257
#14 0x000055734959e9f8 in ngx_single_process_cycle (cycle=0x55734adc3030) at src/os/unix/ngx_process_cycle.c:323
#15 0x0000557349553c34 in main (argc=5, argv=0x7ffc47a67b38) at src/core/nginx.c:383
These can be seen via core dump and via Valgrind.
What's interesting is that changing the ffi.cdef
definitions changes the values being read by "Invalid read of size 1" reported by Valgrind. If I changed the return values from ngx_int_t
(64-bit) to int
(32-bit), I'd get 0xffffffffffffffff back. If I use the correct 64-bit return type, I get a different, non-all-ffff number (but which is also an invalid pointer).
...so I wanted to experiment to see what happened if I avoided touching the return value of the FFI call and indeed the code seems to continue to churn along.
A quick search led me to this old post of yours. I'm hoping we're just doing something silly that's putting the Lua stack in a bad state, and it's not anything like this again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I saw the same backtraces yesterday night. The nargs
is named as such because even though the underlying argument is nrets
, in our case "how many arguments do we want on the stack when the chunk if finished" is the equivalent of "how many arguments do we get when we do arg1, argn = ...
", which thus becomes the number of arguments we want to pass to the chunk. Hope that makes sense; it did make the bridge interface usage simpler to grasp.
The cause of the backtraces could be unrelated to the stack (I tried playing with the stack in openresty during my debugging yesterday), but I am not sure it could be related to that old patch... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also build against a local OpenResty release archive with NGX_BUILD_OPENRESTY=/path/to/src/openresty-releases/openresty-1.21.4.2 NGX_BUILD_CC_OPT='-DNGX_LUA_USE_ASSERT -O0' make
; so you can look deep into ngx_http_lua_util.c
and what could be going wrong there. Ideally we either only use Lua C API or OpenResty C APIs, but not both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we either only use Lua C API or OpenResty C APIs, but not both.
Yeah, the mix in e077a66 is a bit of an exploration, to understand what's going on bit by bit.
You can also build against a local OpenResty release archive with NGX_BUILD_OPENRESTY=/home/chasum/src/openresty-releases/openresty-1.21.4.2 NGX_BUILD_CC_OPT='-DNGX_LUA_USE_ASSERT -O0' make; so you can look deep into ngx_http_lua_util.c and what could be going wrong there.
Thanks for the pointer — I will do a bit more exploring via the Lua/C API to understand some more of what's going on and then try to switch back to the OpenResty API. That should be useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this new version seems to work!
Unfortunately we're not there yet! This version passes our tests in ngx_wasm_module, but then my original FFI-only version did too.
This one fails less when running the Gateway tests, but I'm still getting LuaJIT crashes. At least the failures are somewhat easy to trigger: out of a 23-test Busted file, I'm getting about 4 tests fail every time (but not always the same ones). The failures are LuaJIT segfaults, and some of them produce messages such as:
LuaJIT ASSERT lj_snap.c:140: snapshot_framelinks: broken frame chain
Given that our own @zhongweiy has just got stack frame fixes merged into LuaJIT (LuaJIT/LuaJIT@65c8493) and that the commit history has a bunch of very recent stack fixes, I'm starting to wonder if it's a matter of trying with a more recent LuaJIT... (Not looking forward to resync the OpenResty patched version, but if that's the only way to test it with the Gateway, I guess that's the way to go...)
Good news that the prop bridge approach seems ok so far, although it seems like said bridge has a bug, or the utilization of OpenResty C APIs is incomplete perhaps. Your conclusion too?
Maybe, I'm not sure. e2f2449 commit message has my thoughts on the Lua bridge change.
In any case, the code as it stands has a leak because I'm creating the coroutine with the Lua bridge API, then running it with lua_resume
and never cleaning it up (because the Lua bridge run function destroys it on its own and I'm not currently calling it.) But the current state was enough to check the Gateway tests, so I went for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(ah, further note: simply changing the lua_resume
back to the ngx_wasm_lua_thread_run
doesn't work, as it then makes the ngx_wasm_module tests fail.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not looking forward to resync the OpenResty patched version, but if that's the only way to test it with the Gateway, I guess that's the way to go...
So I have backported several dozens of commits from LuaJIT's upstream v2.1
into kong/build/openresty/patches
... I blanket-picked all commits from April 10 2023 onwards because the LuaJIT version in OpenResty is called LuaJIT-2.1.-20230410
. Many didn't apply cleanly, indicating quite some divergence between the OpenResty and upstream trees (and Mike's recent move to the "rolling release" model contains some pretty strong words against relying on patched versions).
Anyway, imagine my feeling when I finally got a clean build, ran the Gateway tests and they start running green... until one of them fails. Then I run again, and get 4 failures as usual. Then 2 failures. Then 4 again. Meh. It was a glimmer of hope. If anything, it feels like it crashes a little less on average, but it might be a biased perception from someone who wished those 99 patch files were worth something :laughsob:
Do you know if it is possible to run OpenResty nowadays with full vanilla LuaJIT? It's the next thing I thought of trying, which implies diving into Bazel and the OpenResty build system (...or hosting an alternative OpenResty package with the replaced LuaJIT and just changing a single URL in Bazel... :evilgrin: (yes, I'm inventing imaginary emojis)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to confirm, if I add jit.off()
at the beginning of times in kong/init.lua
, the suite always passes. But then, it also did with the first (FFI-only) version of this commit. :\
With jit.on()
, both the FFI and Lua-bridge versions also trigger pretty much identical failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know if it still is possible to run a vanilla LuaJIT in OpenResty...
Also, to confirm, if I add jit.off() at the beginning of times in kong/init.lua, the suite always passes. But then, it also did with the first (FFI-only) version of this commit.
At least this seems to indicate that our stack manipulation isn't the issue, but sure, that it hints to the JIT is quite perplexing.
efe1c31
to
e077a66
Compare
As far as I can tell this shouldn't be necessary as the coroutine uses the same global state as its parent, but I was observing crashes in OpenResty code triggered by `ngx.log` at the log phase, where the `log` variable used by OpenResty's C code was set to one from already-freed data from a fake request created during `init_worker`. Forcing the Lua state's `r` to match the caller's here made the tests pass and Valgrind happy.
… fails..." This reverts commit efe1c31.
This makes other tests pass, but this is very WIP because it doesn't destroy the thread, etc. Some setup is going wrong when running this because this change produces segfaults with `r->connection->pool` being unset.
e077a66
to
7e803d9
Compare
Closed in favor of #431 |
No description provided.