Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YUNIKORN-1990] Fixing intra-queue preemption: updating checkPreemption… #659

Closed
wants to merge 1 commit into from

Conversation

lixmgl
Copy link
Contributor

@lixmgl lixmgl commented Sep 20, 2023

…QueueGuarantees to check both snapshot and currentQueue

What is this PR for?

Found intra-queue preemption happened, which caused high tier spark application got shut down unexpectedly in production.

Previous logic seems not working for snapshot is a different queue as currentQueue.
More details can be found in first comment on https://issues.apache.org/jira/browse/YUNIKORN-1990

This pr including changes:

  1. A new variable availableResource is defined as a clone of p.headRoom
  2. Within each snapshot of the queue, it will check if it's above or at its guaranteed resource before removing any allocations (alloc).
  3. If the snapshot is not above or at its guaranteed resource, it will add back the alloc that was removed

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

https://issues.apache.org/jira/browse/YUNIKORN-1990

How should this be tested?

Will update test in next iteration.

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

…QueueGuarantees to check both snapshot and currentQueue
@wilfred-s
Copy link
Contributor

JIra as been closed as won't fix, closing the PR also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants