Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents should keeping trying to send events to the principal, until the principal acknowledges that the event has been processed and persisted to the backend. #117

Open
jgwest opened this issue Jul 10, 2024 · 1 comment
Labels
enhancement New feature or request
Milestone

Comments

@jgwest
Copy link
Member

jgwest commented Jul 10, 2024

At present, there are a couple issues with event communication reliability between principal and agent:

  • If the agent is unable to send a message to the principal, the agent will skip the message (not retry it)
    • Thus, principal can get out of sync with agents for significant periods of time (depending on how often new events are sent for that Application)
    • For example, for a very quiet Application (manual sync, with not a lot of K8s resource changes), it might just appear invisible to principal if principal misses the last creation/update event
  • Even if the principal successfully receives a message from agent, there is no guarantee that message will be processed.
    • For example, if the principal container restarts, any unprocessed messages in the queue will be permanently lost.

In GitOps Service, we solved this using a queue stored in RDBMS (the 'Operation's table). The general algorithm is the same, here, although the specifics are slightly different (because principal/agent does not persist queue entries to disk, unlike RDBMS).

To solve both these problems, agent <-> principal communication can work as follows:

  • A) When an agent event occurs, queue it to be sent to principal, replacing any previously waiting events for that resource:

    • If an older event for the same resource (e.g. Application) is already waiting, then replace the old resource event with the new resource event, in the queue:
      • Example: for a resource A, if 3 update events have occurred, 1, 2, 3, and none of them have been sent, we only need to send 3: because update 3 will include any updates that were made in 1 and 2.
    • That is, since each event is self-contained, and newer events necessarily include all changes from older events, there is no point is sending (or even storing in memory) old events.
    • Since we only ever need to keep track of the last event that was sent for each resource, there is minimal memory cost.
  • B) Do not remove an event from the agent queue until the principal indicates that it has been processed AND stored in the principal backend:

    • When an agent sends a message to the principal, the principal must acknowledge that it has been processed and persisted.
    • The agent must wait for the principal to acknowledge that the event has been processed/persisted, before the agent removes it from the agent queue.
    • Why? This solves this problem:
      • Agent sends events 1, 2, 3, 4, 5 to principal
      • Principal receives all events: 1, 2, 3, 4, 5
      • Principal processes events 1, 2, but then OOMs and the container is restarted.
        • Because principal does not use persistent storage for queue events, event 3, 4, and 5 are lost.
      • Principal restarts, but the event queue is empty due to restart.
      • Agent does not resend events 3, 4, 5, because it already sent them.
      • Thus principal permanently misses events 3/4/5.
  • C) On startup of agent, the agent must send the current state of all resources (presumably as update events)

    • Similar to the way that, on startup, Kubernetes controller will reconcile every existing resource on the cluster that they watch, the agent must likewise send update events for all resources.
    • This ensures that principal is in sync with agent, whenever agent restarts.
    • (We may already be doing this, just based on the way that informers work, but I thought I would make this part explicit)

These 3 behaviours work together to ensure that the eventual consistency gap between agent/principal is as small as possible, and no state changes are missed, even in the case of network instability or agent/principal container restarts.

@jannfis
Copy link
Collaborator

jannfis commented Jul 10, 2024

Thanks. This makes total sense to me. C is already tracked in #94 and can be refined there.

@jannfis jannfis added the enhancement New feature or request label Jul 10, 2024
@jannfis jannfis added this to the v0.1.0 milestone Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants