Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_cloud_run_v2_service: when resources are specified cpu_idle defaults to false #17246

Closed
mattmoor opened this issue Feb 10, 2024 · 8 comments

Comments

@mattmoor
Copy link

mattmoor commented Feb 10, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally
    to expedite investigation and resolution of this issue.

Terraform Version

TF: 1.5, 1.6

I think we're on ~5.14 of the Google provider.

Affected Resource(s)

google_cloud_run_v2_service

Terraform Configuration

resource "google_cloud_run_v2_service" "foo" {
  name     = "foo"
  project  = var.project_id
  location = var.region

  template {
    containers {
      image = "..."

      resources {
        limits = {
          cpu    = "2000m"
          memory = "8Gi"
        }
      }
    }
  }
}

Debug Output

No response

Expected Behavior

When the above is applied, it shows up in the console as idling the CPU when requests aren't in flight.

Actual Behavior

When the above is applied, it shows up in the console as an ALWAYS ON CPU.

Steps to reproduce

  1. Apply the above,
  2. Check the console for whether the CPU is "always allocated",
  3. Remove the resource block above, and apply again,
  4. Check the console again.

Important Factoids

IMO it violates the principle of least surprise for the default of a knob I am not turning to CHANGE when I am turning other knobs (which is why this smells like a provider bug). I can see a 20-50x improvement in the CPU allocation metrics in some of our environments after correcting this for just a handful of services (still waiting to see the billing impact).

This is the past ~week's CPU allocation in one environment, where we fixed this for the handful of services that were specifying resources last night:
image

cc @steren

References

No response

b/324764802

@mattmoor mattmoor added the bug label Feb 10, 2024
@github-actions github-actions bot added forward/review In review; remove label to forward service/run labels Feb 10, 2024
@steren
Copy link

steren commented Feb 10, 2024

Unfortunately, this is consistent with the default CPU allocation of Cloud Run services. CPU is throttled outside of requests.

Jobs and future non-request based workloads default to CPU always on.

@steren
Copy link

steren commented Feb 10, 2024

Unless Cloud Run API changes its behavior, this is Working As Intended for the terraform module

@mattmoor
Copy link
Author

Unfortunately, this is consistent with the default CPU allocation of Cloud Run services. CPU is throttled outside of requests

@steren what I am seeing is that when resources are explicitly specified that CPU stops being throttled outside of requests.

@edwardmedia edwardmedia self-assigned this Feb 10, 2024
@edwardmedia
Copy link
Contributor

This behavior is controlled by the api. Forward the issue to the service team

@edwardmedia edwardmedia removed their assignment Feb 10, 2024
@edwardmedia edwardmedia removed the forward/review In review; remove label to forward label Feb 10, 2024
@yanweiguo
Copy link
Contributor

The problem is cpu_idle is defined as a boolean proto in cloud run API. If resources is not set, resources.cpu_idle defaults to true. If resources is set, we don't know a false value of resources.cpu_idle is set by the user explicitly or is from the default value of proto boolean type. So the API just accepts the input value.

I'll update the document to call this out. Unfortunately it can't be improved without API change or a breaking change in TF side to set client side default value to true.

@mattmoor
Copy link
Author

I get it. It's an unfortunate design choice that creates a "footgun" where customers hoping to just express resource bounds end up potentially 10x-ing their bill. After finding this issue, I flagged a whole bunch of places where our folks were holding this wrong, and in all of them the bill dropped by a significant multiple.

Like I said, I think this violates the "principle of least surprise" and you are likely to end up with surprised and grumpy customers as a result of this, so you should think about ways to mitigate this if you can't change this.

We've worked around this ourselves, and I think you've heard the feedback, so we can close this issue out.

@melinath
Copy link
Collaborator

Resolved with improved documentation via GoogleCloudPlatform/magic-modules#10005. Thanks!

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants