-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have FTR correctly wait for Kibana to be green before running tests #89828
Comments
Pinging @elastic/kibana-core (Team:Core) |
Option 2 makes the most sense to me. But i am not sure about any drawbacks to having the license plugin to only report a |
I agree that relying on the status API is probably the most reliable way for FTR to know when it can send its requests (using the API feels a bit more elegant than scraping logs).
How long are we blocking the event-loop when looping through plugin's setup / start methods? If this becomes slow enough we could cause some requests to timeout, e.g. the security plugin makes a request to the licensing service during start but we block the event-loop long enough for this request to timeout. Although plugins should retry failed requests anyway... this might cause some instability. If so we can free up the event-loop by using |
Technically, atm given the security
We're not fully blocking, as setup and start are still distinct phases, and the initialization of core services in-between are async. But |
I'm in favor of 3, because I agree with the assessment that relying on the status API is the most reliable way to know the status of the Kibana server, and the goal of this bit of code is to resolve the promise once the Kibana server is started and ready. I'm rooting for option 3 over 2 because the |
I also prefer |
So, I added logging entries when
So only adding this log and having the FTR await for it instead of the http server message would not change much. I wanted to avoid that, but I guess we will need to also implement the second part of the solution:
@restrry do you think that would make sense? I don't really like it tbh, because the whole licensing plugin API is based on observable, which means that the plugin is technically green even before having fetched the license, and we are only trying to work around the security plugin's auth hook implementation here, but... |
This sounds like the problem, relying on the API or the log isn't going to help us if the status doesn't properly represent that the Kibana server isn't actually usable. |
I agree that the moment when a license is not-fetched-yet, cannot be classified as |
In the legacy status implementation statuses started out uninitialized and relied on plugins to set a |
Yea, I think it's 'fine' to have the licensing plugin be 'not ready' until the license is retrieved, but I don't like the options we have regarding the status it should be until then. Adding a new status such as 'initializing' may sound overkill, but now that we deprecated asynchronous lifecycle, it could actually make sense to have a plugin notify it's performing some asynchronous initialization.
Yea, that a valid argument (even if that could also be addressed, I think, by having the security auth hook properly await the license)
I guess. I think I like this option even less because
But OTOH, this wouldn't affect production code, where changing / adding logic on the
We didn't reinvent the wheel on that one 😄 , that's pretty much what the KP status service does too. |
That doesn't scale well, there might be several plugins with async initialization and there will be a bunch of
It seems to be a viable option. Let's see whether it plays well with the status service logic. |
I guess it should go between |
I think the This discussion also overlaps with some work I have on the backburner for making the If we do both of these changes, we'll need to verify that the overall status reported by the API never transitions to |
In that case, I'm fine just not adding an additional status level.
In my POC, I only start listening to the status / log status messages at the end of core's start, to let plugins emit their initial status during their setup phase, so I think it would work ok. |
In #89562, we migrated most of the plugins to synchronous lifecycle, and we stopped asynchronously waiting for all plugins when invoking their
setup
orstart
methods (we only do for the few remaining async ones now).Problem
This had the unexpected consequence to cause some failures in FTR tests: some x-pack suites were failing, reporting that the first request against Kibana is returning a
503
.After some investigation, the 503 is coming from the
security
'sauth
hook, throws an error because it does not have received the license yetkibana/x-pack/plugins/security/server/authentication/authentication_service.ts
Lines 68 to 76 in c13d07d
Note that the
auth
hook is not actively waiting for the license, and is instead relying on a 'cache' based synchronous accessor:kibana/x-pack/plugins/security/common/licensing/license_service.ts
Lines 30 to 36 in c13d07d
The license observable is created by the
licensing
plugin:kibana/x-pack/plugins/licensing/server/plugin.ts
Lines 153 to 157 in 37d965a
After some tests, the (confirmed) conclusion is that the fact that we are no longer performing async-based operation when loading the plugins causes the node process to have less opportunities to get back to the event loop to process async and queued tasks. One of these tasks being the first emission of the timer that would cause the license updater to fetch the license:
kibana/x-pack/plugins/licensing/server/plugin.ts
Line 153 in 37d965a
That, combined with the fact that due to #89562 changes, Kibana starts approximatively between 1 and 2 seconds faster, and the fact that the FTR is just waiting for the
http server running
message before immediately starting the tests, causes some race condition between the license retrieval and the first http calls performed by FTRkibana/packages/kbn-test/src/functional_tests/lib/run_kibana_server.js
Lines 32 to 42 in 170a295
FWIW, as the Kibana server and the FTR runner are started in the same node process, the delay between the
http server running
and the first test execution is ~1ms (!)Solutions
In #89562 we went with the quick workaround to just add a delay between the call to
runKibanaServer
andrunFtr
kibana/packages/kbn-test/src/functional_tests/tasks.js
Lines 96 to 100 in 320dd5f
If that does the trick, as (imho) 5sec delay, as being way more than what we 'gained' with #89562 changes, this is definitely not a correct solution.
I see two options here:
1. Adapt the
auth
hook to actively wait for the license before performing its checksThat was tried in fdf73fe, and is confirmed to fix the problem. However this is only 'fixing' this specific part of the code, and we way later encounter more subtle side effects on event loop queue based asynchronous work in plugins initialization logic.
2. Use the
/status
endpoint to let FTR wait for the server's statusThis means two things:
licensing
plugin uses core'sstatus
API to only report agreen
status once the initial license fetch has been performed/status
endpoint and await for a globalgreen
status before starting the test suite.3. Have Kibana server logs a message the first time all plugins/services are green
An alternative to
2.
. Same idea, different way to retrieve the information. Note that we still need to ensure that our changes in plugins loading would not causecore
to have a 'false'green
status before the plugins status updaters emit their initial 'red' status if necessary.Overall, I think the correct approach would be the rely on the status API, so either
2.
or3.
, as this is basically the main purpose of thestatus
API+endpoint, and as it is way more generic, in the sense of any other plugin depending on asynchronous task before being truly ready would be able to report their status using the same API (even if atm, only a few, if any, plugins are using thestatus
API).Also, even if we were to go with
2.
, adding log messages from core's status service when the overall, or even a specific plugin's status changes, would probably make sense (in which case,3.
would just be less work as we would not need to perform the/status
endpoint based check and just rely on the log messages we would be adding anyway)wdyt?
cc @legrego @spalger @restrry @joshdover
The text was updated successfully, but these errors were encountered: