-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make nodepool concurrent ops scale better #12488
base: main
Are you sure you want to change the base?
Conversation
This prevents Terraform from making massive number of requests at the same time.
Hello! I am a robot. Tests will require approval from a repository maintainer to run. @ScottSuarez, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look. You can help make sure that review is quick by doing a self-review and by running impacted tests locally. |
Hi there, I'm the Modular magician. I've detected the following information about your changes: Diff reportYour PR generated some diffs in downstreams - here they are.
|
Tests analyticsTotal tests: 219 Click here to see the affected service packages
🟢 All tests passed! View the build log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should look to upstream this locking behavior to retryWhileIncompatibleOperation
.
This resource already has custom locking behavior on these calls. If we want to customize that behavior we should customize that function. That would mean passing that operation back and doing the wait within that function rather then introducing a secondary locking mechanism. Otherwise this will be difficult to interpret and new additions will also need to remember to implement this same behavior you have here.
Hi there, I'm the Modular magician. I've detected the following information about your changes: Diff reportYour PR generated some diffs in downstreams - here they are.
|
Tests analyticsTotal tests: 219 Click here to see the affected service packages
🟢 All tests passed! View the build log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a good idea to sync over a gvc and talk about this a bit!
func retryWhileIncompatibleOperation(timeout time.Duration, lockKey string, clusterLockKey string, createOpFunc func() (*container.Operation, error), waitOpFunc func(*container.Operation) error) error { | ||
f := func() error { | ||
transport_tpg.MutexStore.Lock(clusterLockKey) | ||
op, err := createOpFunc() | ||
transport_tpg.MutexStore.Unlock(clusterLockKey) | ||
if err != nil { | ||
return err | ||
} | ||
transport_tpg.MutexStore.RLock(clusterLockKey) | ||
defer transport_tpg.MutexStore.RUnlock(clusterLockKey) | ||
return waitOpFunc(op) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I don't believe this to actually be changing anything after going through the behavior loop in my head. LockedCall(createLockKey, ...)
holds the lock for the entire operation (both creation and waiting). This means the additional locking/unlocking with waitLockKey
inside the operation serves no purpose, as the outer lock already prevents concurrent execution.
Also, I don't believe concurrent updates are supported in the provider. all operations right now are synchronous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the goal is that we don't want multiple nodepools within the same cluster to create operations at exactly the same time. Currently, users may trigger creation of multiple (a lot of) nodepools for a single cluster using terraform and they will be created at the exact same time, overwhelming the GKE pipeline. The nodepool lock is fine, because we want sequential creation at cluster level.
Also happy to chat more on GVC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I understand it a bit better after drawing a diagram. I worry a bit about the readability of this code given that need. I'm okay approving this as is, but could someone from your team also review because the maintenance will fall more broadly there.
sequenceDiagram
participant Caller
participant Retry
participant LockedCall
participant MutexStore
participant CreateOp as createOpFunc
participant WaitOp as waitOpFunc
Caller->>Retry: retryWhileIncompatibleOperation(timeout, npLockKey, clusterLockKey, ...)
activate Retry
Note over Retry: Start retry loop with timeout
Retry->>LockedCall: LockedCall(npLockKey, f)
activate LockedCall
LockedCall->>MutexStore: Lock(npLockKey)
Note over MutexStore: Write lock on nodepool
LockedCall->>MutexStore: Lock(clusterLockKey)
Note over MutexStore: Write lock on cluster
LockedCall->>CreateOp: createOpFunc()
activate CreateOp
CreateOp-->>LockedCall: Return operation
deactivate CreateOp
LockedCall->>MutexStore: Unlock(clusterLockKey)
Note over MutexStore: Release cluster write lock
LockedCall->>MutexStore: RLock(clusterLockKey)
Note over MutexStore: Read lock on cluster
LockedCall->>WaitOp: waitOpFunc(op)
activate WaitOp
WaitOp-->>LockedCall: Return result
deactivate WaitOp
LockedCall->>MutexStore: RUnlock(clusterLockKey)
Note over MutexStore: Release cluster read lock
LockedCall->>MutexStore: Unlock(npLockKey)
Note over MutexStore: Release nodepool lock (via defer)
alt Operation succeeded
LockedCall-->>Retry: Return nil
Retry-->>Caller: Return nil
else IsFailedPreconditionError or IsQuotaError
LockedCall-->>Retry: Return error
Note over Retry: Retry operation
Retry->>LockedCall: LockedCall(npLockKey, f)
else Other error
LockedCall-->>Retry: Return error
Retry-->>Caller: Return error immediately
end
deactivate LockedCall
deactivate Retry
// Retries an operation while the canonical error code is FAILED_PRECONDTION or RESOURCE_EXHAUSTED which indicates | ||
// there is an incompatible operation already running on the cluster or there are the number of allowed concurrent | ||
// operations running on the cluster. These errors can be safely retried until the incompatible operation completes, | ||
// and the newly requested operation can begin. | ||
// The cluster lock is held during createOpFunc to make opeation creations sequencial, and cluster read lock is held | ||
// during waitOpFunc to allow concurrency. | ||
func retryWhileIncompatibleOperation(timeout time.Duration, lockKey string, clusterLockKey string, createOpFunc func() (*container.Operation, error), waitOpFunc func(*container.Operation) error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Retries an operation while the canonical error code is FAILED_PRECONDTION or RESOURCE_EXHAUSTED which indicates | |
// there is an incompatible operation already running on the cluster or there are the number of allowed concurrent | |
// operations running on the cluster. These errors can be safely retried until the incompatible operation completes, | |
// and the newly requested operation can begin. | |
// The cluster lock is held during createOpFunc to make opeation creations sequencial, and cluster read lock is held | |
// during waitOpFunc to allow concurrency. | |
func retryWhileIncompatibleOperation(timeout time.Duration, lockKey string, clusterLockKey string, createOpFunc func() (*container.Operation, error), waitOpFunc func(*container.Operation) error) error { | |
// Retries an operation while the canonical error code is FAILED_PRECONDTION or RESOURCE_EXHAUSTED which indicates | |
// there is an incompatible operation already running on the cluster or there are the number of allowed concurrent | |
// operations running on the cluster. These errors can be safely retried until the incompatible operation completes, | |
// and the newly requested operation can begin. | |
// The npLockKey is held during createOpFunc to make operations sequential within the node pool, and | |
// clusterLockKey is held during waitOpFunc to allow concurrency on a cluster. | |
func retryWhileIncompatibleOperation(timeout time.Duration, npLockKey string, clusterLockKey string, createOpFunc func() (*container.Operation, error), waitOpFunc func(*container.Operation) error) error { |
func retryWhileIncompatibleOperation(timeout time.Duration, lockKey string, clusterLockKey string, createOpFunc func() (*container.Operation, error), waitOpFunc func(*container.Operation) error) error { | ||
f := func() error { | ||
transport_tpg.MutexStore.Lock(clusterLockKey) | ||
op, err := createOpFunc() | ||
transport_tpg.MutexStore.Unlock(clusterLockKey) | ||
if err != nil { | ||
return err | ||
} | ||
transport_tpg.MutexStore.RLock(clusterLockKey) | ||
defer transport_tpg.MutexStore.RUnlock(clusterLockKey) | ||
return waitOpFunc(op) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I understand it a bit better after drawing a diagram. I worry a bit about the readability of this code given that need. I'm okay approving this as is, but could someone from your team also review because the maintenance will fall more broadly there.
sequenceDiagram
participant Caller
participant Retry
participant LockedCall
participant MutexStore
participant CreateOp as createOpFunc
participant WaitOp as waitOpFunc
Caller->>Retry: retryWhileIncompatibleOperation(timeout, npLockKey, clusterLockKey, ...)
activate Retry
Note over Retry: Start retry loop with timeout
Retry->>LockedCall: LockedCall(npLockKey, f)
activate LockedCall
LockedCall->>MutexStore: Lock(npLockKey)
Note over MutexStore: Write lock on nodepool
LockedCall->>MutexStore: Lock(clusterLockKey)
Note over MutexStore: Write lock on cluster
LockedCall->>CreateOp: createOpFunc()
activate CreateOp
CreateOp-->>LockedCall: Return operation
deactivate CreateOp
LockedCall->>MutexStore: Unlock(clusterLockKey)
Note over MutexStore: Release cluster write lock
LockedCall->>MutexStore: RLock(clusterLockKey)
Note over MutexStore: Read lock on cluster
LockedCall->>WaitOp: waitOpFunc(op)
activate WaitOp
WaitOp-->>LockedCall: Return result
deactivate WaitOp
LockedCall->>MutexStore: RUnlock(clusterLockKey)
Note over MutexStore: Release cluster read lock
LockedCall->>MutexStore: Unlock(npLockKey)
Note over MutexStore: Release nodepool lock (via defer)
alt Operation succeeded
LockedCall-->>Retry: Return nil
Retry-->>Caller: Return nil
else IsFailedPreconditionError or IsQuotaError
LockedCall-->>Retry: Return error
Note over Retry: Retry operation
Retry->>LockedCall: LockedCall(npLockKey, f)
else Other error
LockedCall-->>Retry: Return error
Retry-->>Caller: Return error immediately
end
deactivate LockedCall
deactivate Retry
The purpose of this change is to stagger operation creations, but still keeps operations concurrent, so that operations are not bottlenecked on creation and fail.
To achieve the above,
Release Note Template for Downstream PRs (will be copied)
See Write release notes for guidance.