[YUNIKORN-2837] Log & Send Events, Improve logging #957

manirajv06 · 2024-08-28T07:21:26Z

What is this PR for?

Send events in appropriate places, enhanced logging etc to help understand the preemption flow while debugging preemption related issues.

What type of PR is it?

- Improvement

Todos

- Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-2837

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

codecov · 2024-08-28T07:28:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.85%. Comparing base (6f06490) to head (1365158).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #957   +/-   ##
=======================================
  Coverage   80.84%   80.85%           
=======================================
  Files          97       97           
  Lines       12512    12517    +5     
=======================================
+ Hits        10115    10120    +5     
  Misses       2126     2126           
  Partials      271      271

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

craigcondit

I'm -1 on this entire approach (see comments on JIRA for a detailed explanation).

pbacsko

I'm also -1, we need to be careful about what we send & how often.

pbacsko · 2024-08-28T18:13:58Z

pkg/scheduler/objects/allocation.go

+func (a *Allocation) setPreemptionPreConditionsCheckPassed() {
+	a.Lock()
+	defer a.Unlock()
+	if a.preemptionPreConditionsCheckFailed {
+		a.preemptionPreConditionsCheckFailed = false
+		a.askEvents.SendPreemptionPreConditionsCheckPassed(a.allocationKey, a.applicationID, a.allocatedResource)
+	}
+}
+
+func (a *Allocation) setPreemptionPreConditionsCheckFailed() {
+	a.Lock()
+	defer a.Unlock()
+	if !a.preemptionPreConditionsCheckFailed {
+		a.preemptionPreConditionsCheckFailed = true
+		a.askEvents.SendPreemptionPreConditionsCheckFailed(a.allocationKey, a.applicationID, a.allocatedResource)
+	}
+}
+
+func (a *Allocation) setPreemptionQueueGuaranteesCheckPassed() {
+	a.Lock()
+	defer a.Unlock()
+	if a.preemptionQueueGuaranteesCheckFailed {
+		a.preemptionQueueGuaranteesCheckFailed = false
+		a.askEvents.SendPreemptionQueueGuaranteesCheckPassed(a.allocationKey, a.applicationID, a.allocatedResource)
+	}
+}
+
+func (a *Allocation) setPreemptionQueueGuaranteesCheckFailed() {
+	a.Lock()
+	defer a.Unlock()
+	if !a.preemptionQueueGuaranteesCheckFailed {
+		a.preemptionQueueGuaranteesCheckFailed = true
+		a.askEvents.SendPreemptionQueueGuaranteesCheckFailed(a.allocationKey, a.applicationID, a.allocatedResource)
+	}
+}


These are all REQUEST events which means they will be sent to the respective pods.

I think the only message which might be useful is an indication that a particular pod successfully triggered preemption. Or the other way around: we send a message to the terminated pod that it's been preempted to make room for pod X.

Everything else (guarantees passed/failed, precondition passed/failed) is just too much noise which should not be sent to K8s as a pod event.

The headroom/quota events are different because they're related to YK configuration. Hitting the quota is a useful feedback for the user, because they'll know why a particular pod cannot be scheduled.

I agree we should not send events except in these cases:

Request pod: "Preempting pod {namespace}/{pod} to make room" (there could be multiple of these if we preempt multiple pods)

Victim pods: "Preempting pod to make room for {namespace}/{pod}" (only one of these per victim)

This allows users to see that their pod triggered preemption, or that their pod was preempted and by whom.

As of now, there is only one log line "Reserving node for ask after preemption" being logged at the end of the preemption process. Other than this, we don't have any info.

We need below info to debug the preemption related issues:

How do we know whether preemption has been even attempted or not?

If attempted, Is there any problem in passing through important steps?

If attempted and successful case, how do we know this case? Answer: Above log line helps

I went through all these questions while trying to find the reason for test failures logged in https://issues.apache.org/jira/browse/YUNIKORN-2808.

I think the only message which might be useful is an indication that a particular pod successfully triggered preemption.

Yes. It definitely helps. We need something like this. I had also similar thought, but doing this way would always sends an event. Hence, I followed the other usual model of 1. Sending only one event and that too only when failed. Does not repeat. 2. Sending only one event when succeeded (Only If 1st event happens). Does not repeat.

Request pod: "Preempting pod {namespace}/{pod} to make room" (there could be multiple of these if we preempt multiple pods)

Victim pods: "Preempting pod to make room for {namespace}/{pod}" (only one of these per victim)

Yes, it helps but only in successful case and that too at the end.

We must not log for every attempt as this happens far too frequently. It will overwhelm the system. Currently in the scheduler we only log when action is taken or the state of the system changes.

Pod events definitely should only be sent once, and only when action is taken.

I think an acceptable approach is recording stuff like the allocation log for each request. That looks reasonable to me. We can have a type like AllocationLog and we define fixed causes about why the preemption failed:

"Preconditions checks failed" (preemptor.CheckPreconditions() false)

"No guarantee to free up resources" (p.checkPreemptionQueueGuarantees() false)

"Preemption shortfall" (p.ask.GetAllocatedResource().StrictlyGreaterThanOnlyExisting() false)

Every time something fails, you increase a counter for it. You also count the number of total attempts.

So I'm thinking sth like:

// Mutable fields which need protection allocated bool allocLog map[string]*AllocationLogEntry preemptionLog map[string]*AllocationLogEntry // new field preemptionAttempts int64 // new field preemptionTriggered bool preemptCheckTime time.Time ... var ( ErrPreemptionPrecondFailed = errors.New("Preconditions checks failed") ErrPreemptionNoGuarantee = errors.New("No guarantee to free up resources") ErrPreemptionShortfall = errors.New("Preemption shortfall") ) // OR perhaps better w/ typed errors type PreemptionError int const ErrPreemptionPrecondFailed PreemptionError = iota const ErrPreemptionNoGuarantee const ErrPreemptionShortfall

Then in application.go:

if fullIterator != nil { request.IncPreemptionAttempts() if result, err := sa.tryPreemption(headRoom, preemptionDelay, request, fullIterator, false); err != nil { // preemption occurred, and possibly reservation return result } request.LogPreemptionFailure(err) } }

Obviously you need to make sure that problems propagate back so you can log it. Then this whole thing can be exposed on the REST API just like the allocation log.

It's obviously not as detailed as a separate log entry printed to the stdout, but I see this as a good compromise.

Why not just reuse AllocationLog? It already contains events pertaining to things like predicate failures; preemption attempts can be tracked there easily as well.

Yeah, that's fine too.

Have used allocationLog earlier too in few places in addition with events & logging. Now using only allocationLog, removed events and logs. Still seeing a gap to add debug logs in few places especially after guarantee checks passes. Anyways, we can address this later depends on the need.

craigcondit

-1. Re-requesting a review when nothing has changed isn't helpful.

pbacsko

Looks good, just minor things

pkg/common/errors.go

pkg/scheduler/objects/application.go

pkg/scheduler/objects/preemption_test.go

pkg/scheduler/objects/utilities_test.go

pbacsko

LGTM, I found only a nit

manirajv06 requested review from wilfred-s, craigcondit and pbacsko August 28, 2024 07:21

manirajv06 self-assigned this Aug 28, 2024

manirajv06 mentioned this pull request Aug 28, 2024

[YUNIKORN-2808] E2E test Verify_preemption_on_priority_queue test is … #955

Closed

5 tasks

craigcondit requested changes Aug 28, 2024

View reviewed changes

pbacsko requested changes Aug 28, 2024

View reviewed changes

manirajv06 requested review from pbacsko and craigcondit September 4, 2024 07:18

craigcondit requested changes Sep 4, 2024

View reviewed changes

manirajv06 force-pushed the YUNIKORN-2837 branch from 248573e to f996239 Compare September 5, 2024 08:28

manirajv06 requested a review from craigcondit September 5, 2024 08:32

pbacsko requested changes Sep 5, 2024

View reviewed changes

pkg/common/errors.go Outdated Show resolved Hide resolved

pkg/scheduler/objects/application.go Outdated Show resolved Hide resolved

pkg/scheduler/objects/preemption_test.go Outdated Show resolved Hide resolved

pbacsko reviewed Sep 9, 2024

View reviewed changes

pkg/scheduler/objects/utilities_test.go Outdated Show resolved Hide resolved

pbacsko approved these changes Sep 9, 2024

View reviewed changes

manirajv06 force-pushed the YUNIKORN-2837 branch from 586ecc2 to fde38f3 Compare September 9, 2024 14:34

[YUNIKORN-2837] Log & Send Events, Improve logging

1365158

manirajv06 force-pushed the YUNIKORN-2837 branch from fde38f3 to 1365158 Compare September 9, 2024 14:36

manirajv06 closed this in 3a712d4 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-2837] Log & Send Events, Improve logging #957

[YUNIKORN-2837] Log & Send Events, Improve logging #957

manirajv06 commented Aug 28, 2024

codecov bot commented Aug 28, 2024 •

edited

Loading

craigcondit left a comment

pbacsko left a comment

pbacsko Aug 28, 2024 •

edited

Loading

craigcondit Aug 28, 2024

manirajv06 Sep 4, 2024

craigcondit Sep 4, 2024

craigcondit Sep 4, 2024

pbacsko Sep 4, 2024 •

edited

Loading

craigcondit Sep 4, 2024

pbacsko Sep 4, 2024

manirajv06 Sep 5, 2024

craigcondit left a comment

pbacsko left a comment

pbacsko left a comment

[YUNIKORN-2837] Log & Send Events, Improve logging #957

[YUNIKORN-2837] Log & Send Events, Improve logging #957

Conversation

manirajv06 commented Aug 28, 2024

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

codecov bot commented Aug 28, 2024 • edited Loading

Codecov Report

craigcondit left a comment

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

pbacsko Aug 28, 2024 • edited Loading

Choose a reason for hiding this comment

craigcondit Aug 28, 2024

Choose a reason for hiding this comment

manirajv06 Sep 4, 2024

Choose a reason for hiding this comment

craigcondit Sep 4, 2024

Choose a reason for hiding this comment

craigcondit Sep 4, 2024

Choose a reason for hiding this comment

pbacsko Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

craigcondit Sep 4, 2024

Choose a reason for hiding this comment

pbacsko Sep 4, 2024

Choose a reason for hiding this comment

manirajv06 Sep 5, 2024

Choose a reason for hiding this comment

craigcondit left a comment

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

pbacsko left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 28, 2024 •

edited

Loading

pbacsko Aug 28, 2024 •

edited

Loading

pbacsko Sep 4, 2024 •

edited

Loading