-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
randomize initial zone selection #583
base: master
Are you sure you want to change the base?
Conversation
Note: this comment has been redacted; see followup below
Example from recent outage:
^ note the exhaustion is in the region (Note that we have occasionally hit
I think this is only the case when a zone is specified in the worker config. In our case it looks like instances are pretty well distributed across zones already, no? I wonder if it would make more sense to have each worker create job instances in its own zone, if possible, since those are already automatically distributed by the managed instance group. Also, not all resource types (in particular GPUs) are available in all zones. So I'm not sure it makes sense to make zone pinning an all-or-nothing thing, because I believe that will cause problems when GPUs (and perhaps some specific CPUs) are specified.
|
Edit: I take it all back, I see there were plenty of these errors during the last outage:
|
Thanks for the feedback AJ.
I'm just about to hit the hay, but I thought I would also add that our
Google reps recommended temporarily not using a zone when we hit these
errors.
I'll write up a more detailed reply when I wake up.
…On Mon, Apr 1, 2019 at 11:50 PM, AJ Bowen ***@***.***> wrote:
Edit: I take it all back, I see there were plenty of these errors during
the last outage:
Mar 29 23:32:53 production-2-worker-org-gce-4mr9 travis-worker-wrapper:
time="2019-03-30T04:32:53Z" level=error msg="couldn't start instance, attempting
requeue" err="code=ZONE_RESOURCE_POOL_EXHAUSTED location=
message=The zone 'projects/travis-ci-prod-2/zones/us-central1-c' does not have
enough resources available to fulfill the request. Try a different zone, or try again later."
job_id=123456798 job_path=xyz/xyz/jobs/123456798 pid=1
***@***.***
repository=xyz/xyz self=step_start_instance start_timeout=8m0s uuid=84c6f99e-f021-4b13-baf1-ba101c22e3ab
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#583 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAAh_QtAMyI8vI5Goe_0HV_clixYcSVkks5vceRtgaJpZM4cVTDf>
.
|
This definitely makes more sense now that I know we had hit |
"NETWORK": fmt.Sprintf("network name (default %q)", defaultGCENetwork), | ||
"PREEMPTIBLE": "boot job instances with preemptible flag enabled (default false)", | ||
"PREMIUM_MACHINE_TYPE": fmt.Sprintf("premium machine type (default %q)", defaultGCEPremiumMachineType), | ||
"PROJECT_ID": "[REQUIRED] GCE project id", | ||
"PROJECT_ID": "[REQUIRED] GCE project id (will try to auto detect it if not set)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, [REQUIRED]
can be dropped
backend/gce.go
Outdated
@@ -119,7 +119,7 @@ var ( | |||
"WARMER_URL": "URL for warmer service", | |||
"WARMER_TIMEOUT": fmt.Sprintf("timeout for requests to warmer service (default %v)", defaultGCEWarmerTimeout), | |||
"WARMER_SSH_PASSPHRASE": fmt.Sprintf("The passphrase used to decipher instace SSH keys"), | |||
"ZONE": fmt.Sprintf("zone name (default %q)", defaultGCEZone), | |||
"ZONE": "zone in which to deploy job instaces into (default is to use all zones in the region)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo (instaces
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some fixes!
P.S. share your ideas, feedbacks or issues with us at https://github.com/fixmie/feedback (this message will be removed after the beta stage).
Co-Authored-By: fixmie[bot] <44270338+fixmie[bot]@users.noreply.github.com>
What is the problem that this PR is trying to fix?
Workers, when deployed, are bound to a primary zone. Whenever a job (and VM) is started, the primary zone is used first, then an alternate zone is selected.
The core problem is when we start hitting zone exhaustion errors for a zone. Since there are pools of workers per zone, and each worker will try its primary zone first, which just puts more pressure on the api, raising the risk of api rate limit issues.
What approach did you choose and why?
This is a first step towards a fix, but not yet the full fix.
This changes the concept of a primary zone, and instead has a random zone picked each try. A zone can still be defined in the config, but this zone will be used each and every time.
The follow up work to this PR is to block a zone for 10mins across all workers (possibly using Redis), which will help reduce api usage, and a better self healing worker setup.
How can you test this?
I've tested this locally by connecting it to staging, and it worked a treat.
What feedback would you like, if any?
This is for general discussion
I also need help understanding how to fix the test failures