Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] feat(proxy-wasm) host-managed property getters and setters using the Lua bridge #441

Closed
wants to merge 11 commits into from

Conversation

hishamhm
Copy link
Contributor

@hishamhm hishamhm commented Nov 7, 2023

No description provided.

This feature adds support for host-managed property getter and
setter functions.

These feature is implemented in three stages:

* `ngx_proxy_wasm_properties.c`: in our implementation for
  proxy-wasm properties, we now allow the host to specify a
  function which gets called whenever proxy-wasm receives a
  `set_property` or `get_property` call matching the host-defined
  namespace. Those getter and setter functions are then responsible
  to return and store the values, and are able to produce any
  desired side-effects in the host.

* `ngx_wasm_lua_ffi.c`: We then expose this functionality in our
  Lua FFI interface. This is a lightweight wrapper, which accepts
  a C function pointer directly, and associates this function
  to the request context object.

* `lib/resty/wasmx/proxy_wasm.lua`: Finally, we expose a Lua API
  for setting a Lua functions as setters and getters. The interface
  is such that the host Lua program should set one setter function
  and/or getter function, and then within that function properly
  dispatch the behavior based on the received key. Internally, these
  functions use the fact that LuaJIT is able to convert a Lua
  function into a C function pointer. Due to lifetime management,
  those generated C callbacks are a limited resource in the LuaJIT
  VM, so we produce a single pair of these and then have them
  call the user-defined Lua function, which remains as a normal
  Lua closure. (See http://luajit.org/ext_ffi_semantics.html for
  more details on LuaJIT FFI C callback managment.)

To ensure that the lifetime of strings produced by the host continue
valid by the time the getter callback finished running and the Wasm
engine reads the properties, getter/setter-produced values continue
to be stored in the rbtree for host properties that we already had
for plain `get_property`/`set_property`, introduced in da82087.

Since we need to store this data on the C side anyway, we also
add an extra feature as an optimization: the concept
of "const" host-properties. If the Lua callback returns a truthy
third-argument, that signals that that value should be flagged as
constant. Further calls to `get_property` during the same request
will get the property value directly from the ngx_wasm_module rbtree
"cache" without hitting the getter callback again. Test cases in the
test suite demonstrate this feature.

In this PR, the host namespace continues to be statically defined
at compile time via the NGX_WASM_HOST_PROPERTY_NAMESPACE define.
The regular FFI functions for `set_property` and `get_property`
introduced in 447ef13 also continue to work normally.
The Kong Gateway CI started seeing intermittent crashes with "bad callback" errors:

See:
https://luajit.freelists.narkive.com/sdhSLJSr/how-to-make-bad-callback-more-deterministic
```
The bad callback error happens because some JIT-compiled Lua code calls a C
function which in turn calls an FFI callback.
```

https://lua-l.lua.narkive.com/qXJrNlpP/luajit-ffi-windows-bad-callback-error-in-msgwaitformultipleobjects-proof-of-concept
From Mike Pall:
```
The problem is that a FFI callback cannot safely be called from a
C function which is itself called via the FFI from JIT-compiled
code. In your case this is the call to MsgWaitForMultipleObjects.

I've put in a lot of heuristics to detect this, and it usually
succeeds in disabling compilation for such a function. However in
your case the loop is compiled before the callback is ever called,
so the detection fails.

The straighforward solution is to put the message loop into an
extra Lua function and use jit.off(func)
```

See

local rc = C.ngx_http_wasm_ffi_get_property(r, ckey, cvalue)
local rc = tonumber(rcp[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me the issues we've been seeing prior to this change were maybe more likely to be related to an error in stack manipulation of L or cur_co than anything else... I spent some time looking at it but it warrants more investigation since something could be wrong in the Lua bridge and we never caught it through the resolver chunk due to it having a different number of arguments/return values (nrets lua_resume arg).

Copy link
Contributor Author

@hishamhm hishamhm Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe more likely to be related to an error in stack manipulation of L or cur_co than anything else...

I had a similar hunch. Reading the Lua bridge and the sources for the OpenResty functions it calls confused me a bit because I was seeing the nargs value being passed to an argument called nrets.

And yeah, this commit efe1c31 was just an attempt at an ugly workaround. I was drawn to it because i was consistently seeing backtraces with lj_BC_RET on top:

(gdb) bt
#0  0x00007f9315407633 in lj_BC_RET () from /home/hisham/work/kong/ngx_wasm_module/work/buildroot/prefix/luajit/lib/libluajit-5.1.so.2
#1  0x00005573496c3e5c in ngx_http_lua_run_thread (L=0x7f9313fe0380, r=0x55734c3cf090, ctx=0x55734ade7ea0, nrets=0) at ../ngx_lua-0.10.21/src/ngx_http_lua_util.c:1185
#2  0x00005573496cd612 in ngx_http_lua_access_by_chunk (L=0x7f9313fe0380, r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:337
#3  0x00005573496cd1ea in ngx_http_lua_access_handler_inline (r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:185
#4  0x00005573496cd123 in ngx_http_lua_access_handler (r=0x55734c3cf090) at ../ngx_lua-0.10.21/src/ngx_http_lua_accessby.c:158
#5  0x00005573495bc512 in ngx_http_core_access_phase (r=0x55734c3cf090, ph=0x55734adede18) at src/http/ngx_http_core_module.c:1110
#6  0x00005573495bbb78 in ngx_http_core_run_phases (r=0x55734c3cf090) at src/http/ngx_http_core_module.c:885
#7  0x00005573495bbae1 in ngx_http_handler (r=0x55734c3cf090) at src/http/ngx_http_core_module.c:868
#8  0x00005573495cca05 in ngx_http_process_request (r=0x55734c3cf090) at src/http/ngx_http_request.c:2120
#9  0x00005573495cb076 in ngx_http_process_request_headers (rev=0x55734c202af0) at src/http/ngx_http_request.c:1498
#10 0x00005573495ca3b2 in ngx_http_process_request_line (rev=0x55734c202af0) at src/http/ngx_http_request.c:1165
#11 0x00005573495c897d in ngx_http_wait_request_handler (rev=0x55734c202af0) at src/http/ngx_http_request.c:503
#12 0x00005573495a2f15 in ngx_epoll_process_events (cycle=0x55734adc3030, timer=60000, flags=1) at src/event/modules/ngx_epoll_module.c:901
#13 0x000055734958eb39 in ngx_process_events_and_timers (cycle=0x55734adc3030) at src/event/ngx_event.c:257
#14 0x000055734959e9f8 in ngx_single_process_cycle (cycle=0x55734adc3030) at src/os/unix/ngx_process_cycle.c:323
#15 0x0000557349553c34 in main (argc=5, argv=0x7ffc47a67b38) at src/core/nginx.c:383

These can be seen via core dump and via Valgrind.

What's interesting is that changing the ffi.cdef definitions changes the values being read by "Invalid read of size 1" reported by Valgrind. If I changed the return values from ngx_int_t (64-bit) to int (32-bit), I'd get 0xffffffffffffffff back. If I use the correct 64-bit return type, I get a different, non-all-ffff number (but which is also an invalid pointer).

...so I wanted to experiment to see what happened if I avoided touching the return value of the FFI call and indeed the code seems to continue to churn along.

A quick search led me to this old post of yours. I'm hoping we're just doing something silly that's putting the Lua stack in a bad state, and it's not anything like this again...

Copy link
Member

@thibaultcha thibaultcha Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I saw the same backtraces yesterday night. The nargs is named as such because even though the underlying argument is nrets, in our case "how many arguments do we want on the stack when the chunk if finished" is the equivalent of "how many arguments do we get when we do arg1, argn = ...", which thus becomes the number of arguments we want to pass to the chunk. Hope that makes sense; it did make the bridge interface usage simpler to grasp.
The cause of the backtraces could be unrelated to the stack (I tried playing with the stack in openresty during my debugging yesterday), but I am not sure it could be related to that old patch... 🤔

Copy link
Member

@thibaultcha thibaultcha Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also build against a local OpenResty release archive with NGX_BUILD_OPENRESTY=/path/to/src/openresty-releases/openresty-1.21.4.2 NGX_BUILD_CC_OPT='-DNGX_LUA_USE_ASSERT -O0' make; so you can look deep into ngx_http_lua_util.c and what could be going wrong there. Ideally we either only use Lua C API or OpenResty C APIs, but not both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we either only use Lua C API or OpenResty C APIs, but not both.

Yeah, the mix in e077a66 is a bit of an exploration, to understand what's going on bit by bit.

You can also build against a local OpenResty release archive with NGX_BUILD_OPENRESTY=/home/chasum/src/openresty-releases/openresty-1.21.4.2 NGX_BUILD_CC_OPT='-DNGX_LUA_USE_ASSERT -O0' make; so you can look deep into ngx_http_lua_util.c and what could be going wrong there.

Thanks for the pointer — I will do a bit more exploring via the Lua/C API to understand some more of what's going on and then try to switch back to the OpenResty API. That should be useful!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this new version seems to work!

Unfortunately we're not there yet! This version passes our tests in ngx_wasm_module, but then my original FFI-only version did too.

This one fails less when running the Gateway tests, but I'm still getting LuaJIT crashes. At least the failures are somewhat easy to trigger: out of a 23-test Busted file, I'm getting about 4 tests fail every time (but not always the same ones). The failures are LuaJIT segfaults, and some of them produce messages such as:

LuaJIT ASSERT lj_snap.c:140: snapshot_framelinks: broken frame chain

Given that our own @zhongweiy has just got stack frame fixes merged into LuaJIT (LuaJIT/LuaJIT@65c8493) and that the commit history has a bunch of very recent stack fixes, I'm starting to wonder if it's a matter of trying with a more recent LuaJIT... (Not looking forward to resync the OpenResty patched version, but if that's the only way to test it with the Gateway, I guess that's the way to go...)

Good news that the prop bridge approach seems ok so far, although it seems like said bridge has a bug, or the utilization of OpenResty C APIs is incomplete perhaps. Your conclusion too?

Maybe, I'm not sure. e2f2449 commit message has my thoughts on the Lua bridge change.

In any case, the code as it stands has a leak because I'm creating the coroutine with the Lua bridge API, then running it with lua_resume and never cleaning it up (because the Lua bridge run function destroys it on its own and I'm not currently calling it.) But the current state was enough to check the Gateway tests, so I went for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ah, further note: simply changing the lua_resume back to the ngx_wasm_lua_thread_run doesn't work, as it then makes the ngx_wasm_module tests fail.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not looking forward to resync the OpenResty patched version, but if that's the only way to test it with the Gateway, I guess that's the way to go...

So I have backported several dozens of commits from LuaJIT's upstream v2.1 into kong/build/openresty/patches... I blanket-picked all commits from April 10 2023 onwards because the LuaJIT version in OpenResty is called LuaJIT-2.1.-20230410. Many didn't apply cleanly, indicating quite some divergence between the OpenResty and upstream trees (and Mike's recent move to the "rolling release" model contains some pretty strong words against relying on patched versions).

Anyway, imagine my feeling when I finally got a clean build, ran the Gateway tests and they start running green... until one of them fails. Then I run again, and get 4 failures as usual. Then 2 failures. Then 4 again. Meh. It was a glimmer of hope. If anything, it feels like it crashes a little less on average, but it might be a biased perception from someone who wished those 99 patch files were worth something :laughsob:

Do you know if it is possible to run OpenResty nowadays with full vanilla LuaJIT? It's the next thing I thought of trying, which implies diving into Bazel and the OpenResty build system (...or hosting an alternative OpenResty package with the replaced LuaJIT and just changing a single URL in Bazel... :evilgrin: (yes, I'm inventing imaginary emojis)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to confirm, if I add jit.off() at the beginning of times in kong/init.lua, the suite always passes. But then, it also did with the first (FFI-only) version of this commit. :\

With jit.on(), both the FFI and Lua-bridge versions also trigger pretty much identical failures.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know if it still is possible to run a vanilla LuaJIT in OpenResty...

Also, to confirm, if I add jit.off() at the beginning of times in kong/init.lua, the suite always passes. But then, it also did with the first (FFI-only) version of this commit.

At least this seems to indicate that our stack manipulation isn't the issue, but sure, that it hints to the JIT is quite perplexing.

@hishamhm hishamhm force-pushed the feat/ffi-properties-bridge-wip branch from efe1c31 to e077a66 Compare November 8, 2023 19:22
As far as I can tell this shouldn't be necessary as the coroutine uses
the same global state as its parent, but I was observing crashes in OpenResty
code triggered by `ngx.log` at the log phase, where the `log` variable used
by OpenResty's C code was set to one from already-freed data from a fake
request created during `init_worker`. Forcing the Lua state's `r` to match
the caller's here made the tests pass and Valgrind happy.
This makes other tests pass, but this is very WIP because it doesn't
destroy the thread, etc. Some setup is going wrong when running this
because this change produces segfaults with `r->connection->pool`
being unset.
@hishamhm hishamhm force-pushed the feat/ffi-properties-bridge-wip branch from e077a66 to 7e803d9 Compare November 9, 2023 00:00
@thibaultcha thibaultcha added the pr/work-in-progress PR: Work in progress label Nov 16, 2023
@hishamhm
Copy link
Contributor Author

Closed in favor of #431

@hishamhm hishamhm closed this Nov 17, 2023
@hishamhm hishamhm deleted the feat/ffi-properties-bridge-wip branch November 17, 2023 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/work-in-progress PR: Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants