Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: React to node changes #182

Open
js-9 opened this issue Jul 22, 2022 · 5 comments
Open

Feature: React to node changes #182

js-9 opened this issue Jul 22, 2022 · 5 comments
Assignees
Labels
feature New feature
Milestone

Comments

@js-9
Copy link
Contributor

js-9 commented Jul 22, 2022

Ideally I do not need to periodically refresh images at all since we tag our images, however what I would need is the images to be cached to new nodes as they join the cluster. This does not happen if the image refresh time is set to 0s to disable periodic refreshing.

@senthilrch senthilrch self-assigned this Nov 10, 2022
@senthilrch senthilrch added the feature New feature label Nov 10, 2022
@senthilrch senthilrch added this to the v0.11.0 milestone Nov 10, 2022
@senthilrch senthilrch changed the title react to node changes Feature: React to node changes Nov 11, 2022
@senthilrch
Copy link
Owner

senthilrch commented Nov 11, 2022

Jotting down some of my thoughts around implementing this feature:-

  • kube-fledged controller already has a Node informer cache. This is today used to list nodes, not yet used to track node lifecycle.
  • How to detect when a new node is added to the cluster? Is it kubelet that creates the Node resource? I assume the status of the node would be NotReady when created and then later updated to Ready status?
  • When a new node gets added and at that time an ImageCache operation is ongoing (e.g. auto-refresh, update), we need a mechanism to queue the node and process it once the previous operation is completed.
  • If auto-refresh is enabled should we ignore new node addition or act upon it?
  • What should be the status/reason/message of the ImageCache resource when reacting to the addition of new node?

@ChevronTango
Copy link

ChevronTango commented Nov 25, 2022

This feature is hugely important to our project. Our new nodes need to be able to start caching images immediately on launch.

@ChevronTango
Copy link

Jotting down some of my thoughts around implementing this feature:-

  • kube-fledged controller already has a Node informer cache. This is today used to list nodes, not yet used to track node lifecycle.
  • How to detect when a new node is added to the cluster? Is it kubelet that creates the Node resource? I assume the status of the node would be NotReady when created and then later updated to Ready status?
  • When a new node gets added and at that time an ImageCache operation is ongoing (e.g. auto-refresh, update), we need a mechanism to queue the node and process it once the previous operation is completed.
  • If auto-refresh is enabled should we ignore new node addition or act upon it?
  • What should be the status/reason/message of the ImageCache resource when reacting to the addition of new node?

Thinking about my use case, any delay to caching images would be undesirable, so I'd be keen to see a job created as a one off on the immediate detection of a new node. Once this job has completed, either success or failure, maybe add a label to the node which would allow the controller to use it regularly going forward.

I wouldn't worry about the node being unschedulable as a special case here, as I that could happen to any node at any time, so should be something you cater for at all times anyway.

As for the status message, a quick one off message should be fine. If the users are doing any logging then hopefully they are tracking and logging all messages on the resource, and not just the latest.

@ChevronTango
Copy link

ChevronTango commented Nov 25, 2022

The other approach, which would be a big change in strategy, would be for each node to have a separate timer, rather than all thr jobs be created at once. Arguably good from a network bandwith perspective, but you'd need something like a daemonset to have the independence to time each node. It wouldn't answer your question of status message on the cache resource though unless you deliberately listed each node's status independently as a different line in the message.

@gaocegege
Copy link
Contributor

@ChevronTango We implemented a simple prototype for this:

https://github.com/tensorchord/kube-fledged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

4 participants