Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have FTR correctly wait for Kibana to be green before running tests #89828

Closed
pgayvallet opened this issue Feb 1, 2021 · 15 comments · Fixed by #92568
Closed

Have FTR correctly wait for Kibana to be green before running tests #89828

pgayvallet opened this issue Feb 1, 2021 · 15 comments · Fixed by #92568
Assignees
Labels
Feature:Legacy Removal Issues related to removing legacy Kibana Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented Feb 1, 2021

In #89562, we migrated most of the plugins to synchronous lifecycle, and we stopped asynchronously waiting for all plugins when invoking their setup or start methods (we only do for the few remaining async ones now).

Problem

This had the unexpected consequence to cause some failures in FTR tests: some x-pack suites were failing, reporting that the first request against Kibana is returning a 503.

After some investigation, the 503 is coming from the security's auth hook, throws an error because it does not have received the license yet

http.registerAuth(async (request, response, t) => {
if (!license.isLicenseAvailable()) {
this.logger.error('License is not available, authentication is not possible.');
return response.customError({
body: 'License is not available.',
statusCode: 503,
headers: { 'Retry-After': '30' },
});
}

Note that the auth hook is not actively waiting for the license, and is instead relying on a 'cache' based synchronous accessor:

this.licenseSubscription = license$.subscribe((nextRawLicense) => {
rawLicense = nextRawLicense;
});
return {
license: Object.freeze({
isLicenseAvailable: () => rawLicense?.isAvailable ?? false,

The license observable is created by the licensing plugin:

const intervalRefresh$ = timer(0, pollingFrequency);
const { license$, refreshManually } = createLicenseUpdate(intervalRefresh$, this.stop$, () =>
this.fetchLicense(clusterClient)
);

After some tests, the (confirmed) conclusion is that the fact that we are no longer performing async-based operation when loading the plugins causes the node process to have less opportunities to get back to the event loop to process async and queued tasks. One of these tasks being the first emission of the timer that would cause the license updater to fetch the license:

const intervalRefresh$ = timer(0, pollingFrequency);

That, combined with the fact that due to #89562 changes, Kibana starts approximatively between 1 and 2 seconds faster, and the fact that the FTR is just waiting for the http server running message before immediately starting the tests, causes some race condition between the license retrieval and the first http calls performed by FTR

await procs.run('kibana', {
cmd: getKibanaCmd(installDir),
args: filterCliArgs(collectCliArgs(config, options)),
env: {
FORCE_COLOR: 1,
...process.env,
...extendNodeOptions(installDir),
},
cwd: installDir || KIBANA_ROOT,
wait: /http server running/,
});

FWIW, as the Kibana server and the FTR runner are started in the same node process, the delay between the http server running and the first test execution is ~1ms (!)

Solutions

In #89562 we went with the quick workaround to just add a delay between the call to runKibanaServer and runFtr

es = await runElasticsearch({ config, options: opts });
await runKibanaServer({ procs, config, options: opts });
// wait 5 seconds to let kibana startup completes
await silence(log, 5000);
await runFtr({ configPath, options: opts });

If that does the trick, as (imho) 5sec delay, as being way more than what we 'gained' with #89562 changes, this is definitely not a correct solution.

I see two options here:

1. Adapt the auth hook to actively wait for the license before performing its checks

That was tried in fdf73fe, and is confirmed to fix the problem. However this is only 'fixing' this specific part of the code, and we way later encounter more subtle side effects on event loop queue based asynchronous work in plugins initialization logic.

2. Use the /status endpoint to let FTR wait for the server's status

This means two things:

  1. Have the licensing plugin uses core's status API to only report a green status once the initial license fetch has been performed
  2. Have FTR call the /status endpoint and await for a global green status before starting the test suite.

3. Have Kibana server logs a message the first time all plugins/services are green

An alternative to 2.. Same idea, different way to retrieve the information. Note that we still need to ensure that our changes in plugins loading would not cause core to have a 'false' green status before the plugins status updaters emit their initial 'red' status if necessary.

Overall, I think the correct approach would be the rely on the status API, so either 2. or 3., as this is basically the main purpose of the status API+endpoint, and as it is way more generic, in the sense of any other plugin depending on asynchronous task before being truly ready would be able to report their status using the same API (even if atm, only a few, if any, plugins are using the status API).

Also, even if we were to go with 2., adding log messages from core's status service when the overall, or even a specific plugin's status changes, would probably make sense (in which case, 3. would just be less work as we would not need to perform the /status endpoint based check and just rely on the log messages we would be adding anyway)

wdyt?

cc @legrego @spalger @restrry @joshdover

@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Legacy Removal Issues related to removing legacy Kibana labels Feb 1, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh
Copy link
Member

Bamieh commented Feb 1, 2021

Option 2 makes the most sense to me. But i am not sure about any drawbacks to having the license plugin to only report a green status to the api

@rudolf
Copy link
Contributor

rudolf commented Feb 1, 2021

I agree that relying on the status API is probably the most reliable way for FTR to know when it can send its requests (using the API feels a bit more elegant than scraping logs).

After some tests, the (confirmed) conclusion is that the fact that we are no longer performing async-based operation when loading the plugins causes the node process to have less opportunities to get back to the event loop to process async and queued tasks.

How long are we blocking the event-loop when looping through plugin's setup / start methods? If this becomes slow enough we could cause some requests to timeout, e.g. the security plugin makes a request to the licensing service during start but we block the event-loop long enough for this request to timeout. Although plugins should retry failed requests anyway... this might cause some instability. If so we can free up the event-loop by using setImmediate in between plugins similar to https://github.com/elastic/kibana/blob/master/src/core/server/saved_objects/migrations/core/migrate_raw_docs.ts#L61-L83 or use a worker thread.

@pgayvallet
Copy link
Contributor Author

But i am not sure about any drawbacks to having the license plugin to only report a green status to the api

Technically, atm given the security auth login handler implementation, not having performed the initial license fetch is causing security to not correctly handle incoming requests (this is the root cause of this whole issue). But I agree that ideally this should remains an implementation details of the security plugin.

How long are we blocking the event-loop when looping through plugin's setup / start methods?

We're not fully blocking, as setup and start are still distinct phases, and the initialization of core services in-between are async. But setup and start of the plugins is taking approx 1sec each.

@spalger
Copy link
Contributor

spalger commented Feb 1, 2021

I'm in favor of 3, because I agree with the assessment that relying on the status API is the most reliable way to know the status of the Kibana server, and the goal of this bit of code is to resolve the promise once the Kibana server is started and ready.

I'm rooting for option 3 over 2 because the @kbn/test code specifically is focused on starting the Kibana server process and doesn't currently know which port it runs on. It's possible we could add config parsing to @kbn/test, or figure out the port from the logs, but observing the log output for a status message is a pretty standard way for something running processes to know when they're ready in my experience. That functionality already exists, so if we could reuse it that would be ideal.

@mshustov
Copy link
Contributor

mshustov commented Feb 1, 2021

I'm rooting for option 3 over 2 because the @kbn/test code specifically is focused on starting the Kibana server process and doesn't currently know which port it runs on. It's possible we could add config parsing to @kbn/test, or figure out the port from the logs, but observing the log output for a status message is a pretty standard way for something running processes to know when they're ready in my experience. That functionality already exists, so if we could reuse it that would be ideal.

I also prefer 3. for simplicity, but worth noting that we will have to implement 2. eventually, if we want to support test cases with multiple Kibana instances.

@pgayvallet
Copy link
Contributor Author

So, I added logging entries when core.status.overall$ observer emits. However, as all plugins are currently considered green at startup, the added log just got displayed a few milliseconds after the http server running message:

server    log   [08:47:12.234] [info][server][Kibana][http] http server running at http://localhost:5601
server    log   [08:47:12.236] [info][status] Kibana status is now available

So only adding this log and having the FTR await for it instead of the http server message would not change much.

I wanted to avoid that, but I guess we will need to also implement the second part of the solution:

Have the licensing plugin uses core's status API to only report a green status once the initial license fetch has been performed

@restrry do you think that would make sense? I don't really like it tbh, because the whole licensing plugin API is based on observable, which means that the plugin is technically green even before having fetched the license, and we are only trying to work around the security plugin's auth hook implementation here, but...

@spalger
Copy link
Contributor

spalger commented Feb 8, 2021

all plugins are currently considered green at startup

This sounds like the problem, relying on the API or the log isn't going to help us if the status doesn't properly represent that the Kibana server isn't actually usable.

@mshustov
Copy link
Contributor

mshustov commented Feb 8, 2021

I agree that the moment when a license is not-fetched-yet, cannot be classified as degraded or unavailable (as per ServiceStatus.level). OTOH, it means there is no way to unambiguously determine that Kibana is ready for interaction with an end-user...What alternatives do we have? FTR performs polling of /status and licensing/info endpoints?

@spalger
Copy link
Contributor

spalger commented Feb 8, 2021

In the legacy status implementation statuses started out uninitialized and relied on plugins to set a red/yellow status synchronously on startup, but any plugin which didn't choose a status on startup would get a green status (the default). Could we do something similar here so that the status isn't available/green prematurely?

@pgayvallet
Copy link
Contributor Author

I agree that the moment when a license is not-fetched-yet, cannot be classified as degraded or unavailable

Yea, I think it's 'fine' to have the licensing plugin be 'not ready' until the license is retrieved, but I don't like the options we have regarding the status it should be until then.

Adding a new status such as 'initializing' may sound overkill, but now that we deprecated asynchronous lifecycle, it could actually make sense to have a plugin notify it's performing some asynchronous initialization.

OTOH, it means there is no way to unambiguously determine that Kibana is ready for interaction with an end-user

Yea, that a valid argument (even if that could also be addressed, I think, by having the security auth hook properly await the license)

What alternatives do we have? FTR performs polling of /status and licensing/info endpoints?

I guess. I think I like this option even less because

  • it's not future proof. What if another plugin have a similar behavior that licensing and its license fetching. We would be adding additional polling from FTR? Doesn't sounds very maintainable.
  • we'll need the FTR to know if it is running against oss or xpack Kibana, as the licensing/info is not (atm) reachable on oss

But OTOH, this wouldn't affect production code, where changing / adding logic on the licensing plugin status does.

to set a red/yellow status synchronously on startup, but any plugin which didn't choose a status on startup would get a green status (the default)

We didn't reinvent the wheel on that one 😄 , that's pretty much what the KP status service does too.

@mshustov
Copy link
Contributor

mshustov commented Feb 9, 2021

Yea, that a valid argument (even if that could also be addressed, I think, by having the security auth hook properly await the license)

That doesn't scale well, there might be several plugins with async initialization and there will be a bunch of await in the code.

Adding a new status such as 'initializing' may sound overkill, but now that we deprecated asynchronous lifecycle, it could actually make sense to have a plugin notify it's performing some asynchronous initialization.

It seems to be a viable option. Let's see whether it plays well with the status service logic.

@pgayvallet
Copy link
Contributor Author

Let's see whether it plays well with the status service logic.

I guess it should go between available and degrated. @joshdover wdyt?

@pgayvallet pgayvallet self-assigned this Feb 10, 2021
@joshdover
Copy link
Contributor

I think the degraded status would be fine for this use case and it matches how I've been thinking about that status level. It indicates that the licensing plugin is 'working' and it's API can be accessed, but that it's not able to serve complete data yet.

This discussion also overlaps with some work I have on the backburner for making the /api/status endpoint available immediately, before the start lifecycle kicks off (#79012). I've been meaning to get this PR up and working again here soon.

If we do both of these changes, we'll need to verify that the overall status reported by the API never transitions to available before the licensing plugin registers its custom status. I'm pretty sure this is the case today because the status for SavedObjects does not turn to available until migrations have completed which doesn't happen until start. This feels like it 'works by coincidence' rather than design, so we may want to consider changing the default initial status for all plugins to degraded (or uninitialized if we think this is necessary).

@pgayvallet
Copy link
Contributor Author

I think the degraded status would be fine for this use case and it matches how I've been thinking about that status level.

In that case, I'm fine just not adding an additional status level.

This feels like it 'works by coincidence' rather than design

In my POC, I only start listening to the status / log status messages at the end of core's start, to let plugins emit their initial status during their setup phase, so I think it would work ok.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Legacy Removal Issues related to removing legacy Kibana Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
7 participants