Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Should we extend the purpose of the migrator node? #187312

Open
afharo opened this issue Jul 1, 2024 · 11 comments
Open

[Discuss] Should we extend the purpose of the migrator node? #187312

afharo opened this issue Jul 1, 2024 · 11 comments
Labels
discuss Feature:Migrations Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@afharo
Copy link
Member

afharo commented Jul 1, 2024

During startup, Kibana sets up a lot of things: indices, mappings, installs/upgrades Fleet packages, and migrates documents. Most of this setup happens in the ui node role, as we're trying to free the background-tasks nodes from extra unnecessary load to prioritize stability in tasks and rules.

However, this also means that the ui node can be glitchy/overloaded during startup, as some of these setup processes can use a large amount of memory.

At the moment, our migrator role is only focused on the migration of Saved Objects. Is it worth extending the capabilities of this node role to take care of these setup-after-install/upgrade tasks?

Alternatively, would it make sense to introduce a 4th node role for this purpose?

cc @elastic/kibana-core @lukeelmers @kobelb

@afharo afharo added discuss Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Migrations labels Jul 1, 2024
@jloleysens
Copy link
Contributor

This sounds promising, but it's possible this could lead to some undesirable things we'd need to solve like:

  • Migrator nodes crashing instead of UI nodes crashing. Today this would lead to "failed migration" alerts. We'd need to extract a differentiated signal from a migrator node crashing: was it due to migrations failures or other issues? Otherwise detecting actual migration failures will be trickier.
  • Originally the migrator node was designed (and assumed) to be a non-blocking background step. It's possible setup steps happening in UI nodes are being relied on by code serving requests? We'd need to check, as the change in purpose could have design implications.

@afharo
Copy link
Member Author

afharo commented Jul 2, 2024

++ there are many considerations to take into account. The ones you mentioned plus others like: what's the signal to claim all setup tasks are complete so that the migrator can exit the process? At the moment, we control when the SO migration is complete, and intentionally finish the process. With the current Kibana architecture and the plugin sync lifecycle methods leading to decoupled setup promises, we need a new way to know when the setup is really complete (or errored).

Thinking out loud, I wonder if it's worth considering a new core API where plugin/services can register all their setup methods.
Core could handle errors, and perform retries on ES connectivity failures. This method could receive a "runsOn" (name tbd) option to claim if it's needed for UI/BT/none (can run on a migrator).

@pgayvallet
Copy link
Contributor

pgayvallet commented Jul 2, 2024

However, this also means that the ui node can be glitchy/overloaded during startup, as some of these setup processes can use a large amount of memory.

Just to get some context, are we thinking of anything specific / a problem we're already observing in production, or are we just trying to be proactive regardin this risk?

I wonder if it's worth considering a new core API where plugin/services can register all their setup methods.

Isn't it something already solved by taskManager?

@afharo
Copy link
Member Author

afharo commented Jul 2, 2024

Just to get some context, are we thinking of anything specific / a problem we're already observing in production, or are we just trying to be proactive regardin this risk?

Known issue in our Serverless environment: Fleet installing packages sometimes OOMs our UI nodes (incident 352)

I wonder if it's worth considering a new core API where plugin/services can register all their setup methods.

Isn't it something already solved by taskManager?

Interesting take... I don't know the answer to this, tbh...

@pgayvallet
Copy link
Contributor

Known issue in our Serverless environment: Fleet installing packages sometimes OOMs our UI nodes (incident 352)

Okay so then I have to ask: how would what is proposed in that issue help in any way? It's not like the bg task nodes, or the migrator pod, have more allocated memory, or less minimum memory usage, that the UI nodes, or that it's less of a problem to have those node types OOM instead of the UI? Not sure moving responsibilities around really can solve anything when some magnificiently designed piece of engineering requires more memory than what we're ready to allocate on our serverless instances?

From my understand the real solution to the problem was to have this package installation require less memory, either by having smaller packages, or by "streaming" the installation of the assets, or any other technical approach that could solve the problem?

Now don't get me wrong - I'm all in for moving initialization logic out of the UI nodes. We already know for a while that it will be a prerequisite for multitenanced Kibana, as we will have to init the cluster "remotely". But I doubt this is our primary intention here.

@jloleysens
Copy link
Contributor

jloleysens commented Jul 4, 2024

Now don't get me wrong - I'm all in for moving initialization logic out of the UI nodes. We already know for a while that it will be a prerequisite for multitenanced Kibana, as we will have to init the cluster "remotely".

Yeah I share these concerns (per my original comment). I like the idea of moving code out of Kibana, this is indeed the promising part I was talking about, but yeah seems more like the path to "solution independence" will achieve this.

@afharo
Copy link
Member Author

afharo commented Jul 8, 2024

It's not like the bg task nodes, or the migrator pod, have more allocated memory, or less minimum memory usage, that the UI nodes, or that it's less of a problem to have those node types OOM instead of the UI?

IMO, bumping the allocated memory for a migrator pod is way cheaper than doing so for UI/background tasks nodes. The fact that's a job that eventually finishes and releases the resources makes it less of a problem to be granted more resources temporarily.

On top of that, our Serverless offering already supports ZDT, so, IMO, UI nodes should be able to serve traffic without being affected by the migration/setup phases.

Finally, I'm kind of worried that today we're overlooking the fact that the migrations may occur way after the setup has completed in the UI nodes. I wonder if we have any setup logic today that relies on the migrations being completed, and we're just being lucky (or blind) that's not breaking anything so far.

@jloleysens
Copy link
Contributor

jloleysens commented Jul 9, 2024

releases the resources makes it less of a problem to be granted more resources temporarily

That's true, but my understanding of the concern is that increasing memory for these setup processes may be affording/hiding things that should be fixed in code. There may be cases where we really do need more memory and I agree that migrator nodes are one option for isolating such work, might be worth checking with the control plane team what the options are.

I wonder if we have any setup logic today that relies on the migrations being completed

It's possible, but that would break a principle of our current ZDT design and should be fixed. Is there a case you have in mind?

@lukeelmers
Copy link
Member

I agree with Pierre: if we want to have a project around moving init logic outside of ui nodes, let's have that discussion (my suspicion is that handling such logic in the existing migrator nodes is probably not the way to do this).

But the recent issues with memory pressure from package installation should not be our main motivation for doing this... we should be fixing that problem at the fleet & security solution level. Moving that work outside of Kibana makes the pain go away temporarily, but it doesn't address the root cause.

@afharo
Copy link
Member Author

afharo commented Nov 6, 2024

Sorry for bringing this up again (I was going through old notifications).

Another motivation is that UI nodes autoscale to serve HTTP traffic. If a new node needs to run through not-UI-required setup checks, that's time wasted when it matters the most.

Also: I agree that overusing memory should be fixed whenever the leak exists. However, I think it's still a good position to be able to (temporarily) bump the memory of jobs while having leaner UI & Background Tasks pods. Be it migrator or setup jobs.

@pgayvallet
Copy link
Contributor

Another motivation is that UI nodes autoscale to serve HTTP traffic. If a new node needs to run through not-UI-required setup checks, that's time wasted when it matters the most.

So, for some GenAI related feature (that PR), I recently had to go through that exact exercice: I have some potentially resource intensive task I need to run at startup (and also later on demand, but that's not the focus here), and I don't want to run it on UI nodes for obvious reasons.

In the end, I used task manager for that, and it worked relatively fine (well, I think 😅) . I schedule the task during start, and it will eventually run, doing the "initialization logic". The only (well, main) drawback is that I need to be able to retrieve/compute the outcome of that task "manually", as the TM APIs don't really allow to keep track of a task execution after its run. But that aside, it served the purpose quite well.

So I was wondering: what more would we gain from doing that on the migrator node / why wouldn't be that "run as TM task at startup" approach sufficient? I hear the memory/resources argument, but I feel like any reasonable task should be able to run on the resources allocated to our TM nodes, shouldn't it (especially now that we can define "cost" for tasks)

And if it could be sufficient, then maybe we should be thinking of some "thin" wrapper of TM exposed from Core, that would be specialized on running startup tasks, and would be delegating to TM under the hood. (of course we would then figure out how we can register/use/re-expose TM from Core as we did for, say, security).

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Migrations Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests

4 participants