Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK: Adjust client RetryMax to not exceed timeout #1072

Merged
merged 1 commit into from
Sep 25, 2024

Conversation

magodo
Copy link
Contributor

@magodo magodo commented Sep 6, 2024

This PR modifies the logic about how the client.RetryMax is deduced. The main motivation is to avoid the retry exceeds the context dead line (if any).

This is useful for terraform-provider-azurerm users to get a useful error message instead of a useless "context deadline exceeded" error message, especially when doing a terraform plan, given all resources' read timeout is defined as 5min.

Reference

hashicorp/terraform-provider-azurerm#21464

Example

Given a storage account with PE correctly setup, while running in an errorneous network environment, that failed to resolve the storage blob data plane domain name. With this implementation, the retry log is as below:

2024-09-06T15:29:53.734+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service
2024-09-06T15:29:53.947+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:29:53.947+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 1s (9 left)
2024-09-06T15:29:55.109+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:29:55.109+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 2s (8 left)
2024-09-06T15:29:57.273+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:29:57.273+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 4s (7 left)
2024-09-06T15:30:01.440+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:30:01.440+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 8s (6 left)
2024-09-06T15:30:09.605+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:30:09.605+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 16s (5 left)
2024-09-06T15:30:26.150+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:30:26.150+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 32s (4 left)
2024-09-06T15:30:58.398+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:30:58.398+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 1m1s (3 left)
2024-09-06T15:31:59.673+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:31:59.673+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 1m1s (2 left)
2024-09-06T15:33:01.063+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:33:01.063+0800 [DEBUG] provider.terraform-provider-azurerm: [DEBUG] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service: retrying in 1m1s (1 left)
2024-09-06T15:34:02.274+0800 [DEBUG] provider.terraform-provider-azurerm: [ERR] GET https://mgd24984.blob.core.windows.net/?comp=properties&restype=service request failed: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:34:02.275+0800 [ERROR] provider.terraform-provider-azurerm: Response contains error diagnostic: @module=sdk.proto diagnostic_detail="" diagnostic_severity=ERROR
  diagnostic_summary=
  | retrieving static website properties for Storage Account (Subscription: "****"
  | Resource Group Name: "mgd24982"
  | Storage Account Name: "mgd24984"): executing request: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
   tf_proto_version=5.6 tf_provider_addr=registry.terraform.io/hashicorp/azurerm tf_req_id=27c512f9-69b0-b73b-6531-50703cc33451 tf_resource_type=azurerm_storage_account tf_rpc=ReadResource @caller=/home/magodo/github/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform-plugin-go/tfprotov5/internal/diag/diagnostics.go:58 timestamp="2024-09-06T15:34:02.275+0800"
2024-09-06T15:34:02.276+0800 [ERROR] vertex "azurerm_storage_account.main" error: retrieving static website properties for Storage Account (Subscription: "****"
Resource Group Name: "mgd24982"
Storage Account Name: "mgd24984"): executing request: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:34:02.276+0800 [ERROR] vertex "azurerm_storage_account.main (expand)" error: retrieving static website properties for Storage Account (Subscription: "****"
Resource Group Name: "mgd24982"
Storage Account Name: "mgd24984"): executing request: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
2024-09-06T15:34:02.276+0800 [WARN]  Planning encountered errors, so plan is not applyable
2024-09-06T15:34:02.276+0800 [INFO]  backend/local: plan operation completed

Planning failed. Terraform encountered an error while generating this plan.

╷
│ Warning: Resource targeting is in effect
│
│ You are creating a plan with the -target option, which means that the result of this plan may not represent all of the changes requested by the current configuration.
│
│ The -target option is not for routine use, and is provided only for exceptional situations such as recovering from errors or mistakes, or when Terraform specifically suggests to use it as part of an error message.
╵
╷
│ Error: retrieving static website properties for Storage Account (Subscription: "****"
│ Resource Group Name: "mgd24982"
│ Storage Account Name: "mgd24984"): executing request: Get "https://mgd24984.blob.core.windows.net/?comp=properties&restype=service": dial tcp: lookup mgd24984.blob.core.windows.net on 10.255.255.254:53: no such host
│
│   with azurerm_storage_account.main,
│   on main.tf line 17, in resource "azurerm_storage_account" "main":
│   17: resource "azurerm_storage_account" "main" {
│
╵
2024-09-06T15:34:02.289+0800 [DEBUG] provider.stdio: received EOF, stopping recv loop: err="rpc error: code = Unavailable desc = error reading from server: EOF"
2024-09-06T15:34:02.291+0800 [INFO]  provider: plugin process exited: plugin=/home/magodo/go/bin/terraform-provider-azurerm id=564845
2024-09-06T15:34:02.291+0800 [DEBUG] provider: plugin exited

@magodo magodo requested a review from a team as a code owner September 6, 2024 07:41
@github-actions github-actions bot added the release-once-merged The SDK should be released once this PR is merged label Sep 6, 2024
Copy link
Contributor

@manicminer manicminer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @magodo, thanks for looking at this, we can certainly improve our handling here to accommodate a wider range of deadlines. I've plotted the resulting times and they match what one would generally expect to see - minus network delays.

Network delay is the first issue with trying to extrapolate the time available, perhaps we should be slightly more conservative with these timings for safety? That won't account for things like DNS delays though, which can add seconds to the roundtrip.

The other problem is that the exponential backoff is only a fallback - in many (most?) cases, the API returns a Retry-After header, and all these timing go right out the window.

I am wondering if the only way to avoid the context expiring is to catch it before invoking the next retried request. We could extend RetryFunc to receive the context from the configured CheckRetry, then it can determine whether the next retry should take place?

I still think we can merge this, but I don't consider this fully solved just yet 🙂

@manicminer manicminer merged commit 95e46d5 into hashicorp:main Sep 25, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-once-merged The SDK should be released once this PR is merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants