Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

ngearhart · 2024-10-08T21:18:14Z

Describe the bug

When using Terragrunt with S3 Remote State in a Docker container, Terragrunt needs to authenticate to AWS S3 directly (not via underlying terraform). When you are on an EC2 instance that has an IAM Role attached (not access keys), Terragrunt uses the EC2 Metadata API via the underlying AWS Go SDK. This results in very poor performance during the remote state initialization process. On AWS GovCloud us-gov-west-1, the remote state initialization takes >10 seconds in a Docker container, whereas it takes <1 second natively.

Steps To Reproduce

Use S3 Remote State.

remote_state {
  backend = "s3"

  generate = {
    path      = "backend.tf"
    if_exists = "overwrite"
  }

  config = {
    encrypt = true
    key     = format("data/%s/terraform.tfstate", path_relative_to_include())
    bucket  = ...
    region  = ...
    skip_bucket_public_access_blocking = true
    dynamodb_table = ...
  }
}

Create a Docker image with Terragrunt .

FROM alpine:3.20.1 AS builder

# Install curl to download kubectl
RUN apk add --no-cache curl aws-cli

# Define the kubectl version to download
ARG TOFU_VERSION=1.8.3
ARG TERRAGRUNT_VERSION=0.67.16

# Download Tofu
RUN curl -LO https://github.com/opentofu/opentofu/releases/download/v${TOFU_VERSION}/tofu_${TOFU_VERSION}_amd64.apk && \
  mv tofu_${TOFU_VERSION}_amd64.apk /usr/local/bin/tofu.apk && \
  apk add --allow-untrusted /usr/local/bin/tofu.apk

# Download Terragrunt
RUN curl -LO https://github.com/gruntwork-io/terragrunt/releases/download/v${TERRAGRUNT_VERSION}/terragrunt_linux_amd64 && \
  mv terragrunt_linux_amd64 /usr/local/bin/terragrunt

# Make tofu executable
RUN chmod +x /usr/bin/tofu ; chmod +x /usr/local/bin/terragrunt 

# environment variables
ENV TERRAGRUNT_TFPATH="tofu"
ENV TERRAGRUNT_NON_INTERACTIVE="false"
ENV TERRAGRUNT_PROVIDER_CACHE=0
ENV TERRAGRUNT_PARALLELISM=1

# Set default entrypioint to bash
ENTRYPOINT ["/bin/bash"]

Create an EC2 instance with an IAM role attached with necessary permissions.
Exec into the docker image on the EC2 with docker run -it ... bash
Inside the docker container, run terragrunt init (or terragrunt plan,terragrunt apply, etc any command that uses remote state).
Notice that it takes significant time before the underlying terraform init runs.
This "significant time" is at least 10x as long as it would be outside the docker container. In fact, in a certain environment I operate in, it is 4-6 minutes which is unbearably long for each terragrunt operation. I can provide more details about this environment privately.

Expected behavior

The command takes up to a few seconds before actually running the underlying terraform command.

Logs

Here is an example of debug logs (sanitized for privacy).

$ terragrunt init --terragrunt-log-level debug --terragrunt-debug
21:22:45.713 DEBUG  Terragrunt Version: 0.67.1
21:22:45.725 DEBUG  Did not find any locals block: skipping evaluation.
21:22:45.731 DEBUG  Found locals block: evaluating the expressions.
21:22:45.741 DEBUG  Evaluated 2 locals (remaining 0): env, terraform_cache_dir
... env logs ...
21:22:49.344 DEBUG  Running command: tofu --version
21:22:49.420 DEBUG  tofu version: 1.8.1
21:22:49.420 DEBUG  Reading Terragrunt config file at terragrunt.hcl
21:22:49.421 DEBUG  Did not find any locals block: skipping evaluation.
21:22:49.424 DEBUG  Found locals block: evaluating the expressions.
21:22:49.431 DEBUG  Evaluated 2 locals (remaining 0): env, terraform_cache_dir
... env logs ...
21:22:49.464 DEBUG  Getting output of dependency .. for config terragrunt.hcl
... dependency logs ...
21:23:06.924 DEBUG  Found locals block: evaluating the expressions.
21:23:06.931 DEBUG  Evaluated 2 locals (remaining 0): env, terraform_cache_dir
21:23:06.936 DEBUG  Found locals block: evaluating the expressions.
21:23:06.937 DEBUG  Evaluated 2 locals (remaining 0): env, terraform_cache_dir
21:23:06.940 DEBUG  Included config ../../../terragrunt.hcl has strategy shallow merge: merging config in (shallow).
21:23:06.947 DEBUG  Found locals block: evaluating the expressions.
21:23:06.949 DEBUG  Evaluated 1 locals (remaining 0): env
21:23:06.953 DEBUG  Found locals block: evaluating the expressions.
21:23:06.961 DEBUG  Evaluated 1 locals (remaining 0): env
21:23:06.970 DEBUG  Included config ../../../_env/emr.hcl has strategy shallow merge: merging config in (shallow).
21:23:06.970 DEBUG  Detected 1 Hooks
21:23:06.970 INFO   Downloading Terraform configurations from ...
21:23:07.022 DEBUG  Detected 1 Hooks
21:23:07.024 DEBUG  Copying files from...
21:23:07.027 DEBUG  Setting working directory to ...
21:23:07.028 DEBUG  Generated file .terragrunt-cache/w_zPDJwXr8fxnrUd-w10tIHl8HM/Xz4P-Jhavj4obcO3eEDRzJIDlyI/providers.tf.
21:23:07.028 DEBUG  Generated file .terragrunt-cache/w_zPDJwXr8fxnrUd-w10tIHl8HM/Xz4P-Jhavj4obcO3eEDRzJIDlyI/backend.tf.
21:23:07.028 INFO   Debug mode requested: generating debug file terragrunt-debug.tfvars.json in working dir ...
21:23:07.071 DEBUG  The following variables were detected in the terraform module:
21:23:07.071 DEBUG  [...]
21:23:07.071 DEBUG  WARN: The variable ssl_certificate was omitted because it is not defined in the terraform module.
21:23:07.071 DEBUG  WARN: The variable immtua_endpoint was omitted because it is not defined in the terraform module.
21:23:07.071 DEBUG  WARN: The variable custom_logging_filename was omitted because it is not defined in the terraform module.
21:23:07.071 DEBUG  WARN: The variable cert_private_key was omitted because it is not defined in the terraform module.
21:23:07.071 DEBUG  Variables passed to terraform are located in "sanitized"
21:23:07.071 DEBUG  Run this command to replicate how terraform was invoked:
21:23:07.071 DEBUG      terraform -chdir="sanitized" init -var-file="sanitized"
21:23:07.072 DEBUG  Initializing remote state for the s3 backend
21:23:13.330 DEBUG  Verifying AWS S3 Bucket Versioning <bucket name>
21:23:13.337 DEBUG  Checking if SSE is enabled for AWS S3 bucket <bucket name>
21:23:13.358 DEBUG  Checking if bucket <bucket name> is have root access
21:23:13.366 DEBUG  Policy for RootAccess already exists for bucket <bucket name>
21:23:13.366 DEBUG  Checking if bucket <bucket name> is enforced with TLS
21:23:13.374 DEBUG  Policy for EnforcedTLS already exists for bucket <bucket name>
21:23:13.374 DEBUG  S3 bucket is already up to date
21:23:13.374 DEBUG  Verifying AWS S3 Bucket Versioning <bucket name>
21:23:19.665 DEBUG  Running command: tofu init
21:23:19.750 STDOUT tofu: Initializing the backend...
21:23:23.378 STDOUT tofu:
21:23:23.378 STDOUT tofu: Successfully configured the backend "s3"! OpenTofu will automatically
21:23:23.378 STDOUT tofu: use this backend unless the backend configuration changes.
21:23:23.467 STDOUT tofu: Initializing provider plugins...
21:23:23.468 STDOUT tofu: - Finding hashicorp/random versions matching "3.5.1"...
21:23:23.470 STDOUT tofu: - Finding hashicorp/null versions matching "3.2.1"...
... other providers ...
21:23:31.429 STDOUT tofu:
21:23:31.429 STDOUT tofu: OpenTofu has been successfully initialized!
21:23:31.429 STDOUT tofu:
21:23:31.429 STDOUT tofu: You may now begin working with OpenTofu. Try running "tofu plan" to see
21:23:31.429 STDOUT tofu: any changes that are required for your infrastructure. All OpenTofu commands
21:23:31.429 STDOUT tofu: should now work.
21:23:31.429 STDOUT tofu: If you ever set or change modules or backend configuration for OpenTofu,
21:23:31.429 STDOUT tofu: rerun this command to reinitialize your working directory. If you forget, other
21:23:31.429 STDOUT tofu: commands will detect it and remind you to do so if necessary.

Notice the time difference between the "Initializing remote state for the s3 backend" and the next lines (6 seconds). That does not seem that bad but it's so much worse than outside of the docker container.

Versions

Terragrunt version: 0.67.16
OpenTofu version: 1.8.3
Environment details: AWS EC2 instance with IAM role attached, inside Docker container

Workaround

I found a workaround - run the Docker container with Host networking (docker run --network host --it ... bash).

Additional context

I believe this is related to the AWS SDK calling the Instance Metadata service. When I run netstat, I see tons of calls to the .internal DNS name for the Instance Metadata service (169.254.169.254). My theory is that something is funny with the networking and it leads to slowness but not timeouts/errors.

Admitedly, this might be a problem with the underlying AWS Golang SDK, but I think that is unlikely.

The text was updated successfully, but these errors were encountered:

yhakbar · 2024-10-15T15:30:29Z

Hey @ngearhart ,

I believe that reaching out to instance metadata is one of the first steps in all AWS SDK implementations.

I think a more direct fix for your issue is to take advantage of the disable_bucket_update = true configuration, which will prevent all attempts to update your S3 + DynamoDB backend, avoiding the attempt to authenticate with AWS at all.

Long term, the CLI shouldn't attempt to automatically make any adjustments to backend resources without explicit opt-in. I've shared a proposal to address that here: #3445

Closing this issue, as it's not really something that can be addressed with a change to how Terragrunt works.

ngearhart · 2024-10-15T15:33:03Z

@yhakbar Understood. Thanks for walking me through that! I'm comfortable with closing this too, and happy to have this record so if anyone else runs into this, they know the workaround and context.
Have a great day!

ngearhart added the bug Something isn't working label Oct 8, 2024

yhakbar closed this as completed Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

ngearhart commented Oct 8, 2024 •

edited

Loading

yhakbar commented Oct 15, 2024

ngearhart commented Oct 15, 2024

Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

Slow Network Performance with S3 Remote State when in Docker container on IAM Role-attached host #3458

Comments

ngearhart commented Oct 8, 2024 • edited Loading

Describe the bug

Steps To Reproduce

Expected behavior

Logs

Versions

Workaround

Additional context

yhakbar commented Oct 15, 2024

ngearhart commented Oct 15, 2024

ngearhart commented Oct 8, 2024 •

edited

Loading