Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

Open
juangascon opened this issue Nov 21, 2024 · 10 comments
Assignees
Labels
bug forward/review In review; remove label to forward service/container

Comments

@juangascon
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.9.8
on linux_amd64

  • provider registry.terraform.io/hashicorp/google v6.10.0

This happens also in both 5.44.2 and 6.1.0 versions of the provider.
The issue happens probably since the v5.18 when the attribute private_endpoint_subnetwork becomes Optional and not Read-Only.
I do not know if this has a link with issue #15422

Affected Resource(s)

This happens in the google_container_cluster resource when configured as private and we configure the IP range of the control-plane subnetwork using the attribute private_endpoint_subnetwork instead of master_ipv4_cidr_block.

We have two ways of giving the CIDR range for the control plane endpoint:

  1. master_ipv4_cidr_block
  2. private_endpoint_subnetwork

Reading the documentation from GCP "Create a cluster and select the control plane IP address range" it is said:

  • If you use the private-endpoint-subnetwork flag, GKE provisions the control plane internal endpoint with an IP address from the range that you define.
  • If you use the master-ipv4-cidr flag, GKE creates a new subnet from the values that you provide. GKE provisions the control plane internal endpoint with an IP address from this new range.

So, if your organization's security constraints forces you to activate the VPC Logs in all subnetworks, you, and not GCP, have to create the subnet in order to toggle on the feature with Terraform. Terraform can't modify (it is REALLY complicated) the parameters of a resource created out of its scope.
Though, if I create a subnet and I put its name as value for the private_endpoint_subnetwork attribute, I get the following error:

│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for module.gke.google_container_cluster.prototype to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/google"
│ produced an invalid new value for .private_cluster_config[0].enable_private_endpoint: was null, but now cty.False.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

Gemini 1.5 Pro explains the error:
The error message indicates that the google_container_cluster resource's private_cluster_config.enable_private_endpoint attribute is unexpectedly changing from null to false during the apply phase, even though it's not explicitly defined in your configuration.

Claude Sonnet 3.5 details even more:
The problem seems to be in how the provider handles the private_cluster_config state during plan and apply phases, specifically around the PSC (Private Service Connect) clusters.

While the bug is corrected in the provider, a quick bypass working solution proposed by Gemini 1.5 Pro is to explicitly set enable_private_endpoint to null in your configuration:

resource "google_container_cluster" "prototype" {
  # ... other configurations ...
  private_cluster_config {
    enable_private_nodes        = true
    enable_private_endpoint     = null # Explicitly set to null
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }
  # ... rest of your configuration ...
}

Terraform Configuration

resource "google_compute_subnetwork" "cluster_control_plane" {
  name                     = local.control_plane_private_endpoint_subnet_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = var.private_control_plane_subnetwork_ip_cidr_range

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "prototype" {
        // ... other GKE configurations ...
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }
        // ... other GKE configurations ...
}

Debug Output

No response

Expected Behavior

The GKE deployment should end correctly with the custom user-created subnetwork associated to the private control plane.

Actual Behavior

The deployment stops with the following error:

│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for module.gke.google_container_cluster.prototype to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/google"
│ produced an invalid new value for .private_cluster_config[0].enable_private_endpoint: was null, but now cty.False.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

Steps to reproduce

  1. configure the google_container_cluster as a private cluster using the private_cluster_config block. There, we declare the subnet for the private control plane using the attribute private_endpoint_subnetwork instead of declaring a CIDR IP range with the attribute master_ipv4_cidr_block.
  2. terraform apply

Important Factoids

It seems that the file where the bug exists is : resource_container_cluster.go

I am not a Golang coder so, I do not know if the solution proposed by Claude Sonnet 3.5 via GitHub Copilot is good, neither I can submit a pull request.

Claude Sonnet 3.5 says:

Here's what needs to be fixed:
First, in the containerClusterPrivateClusterConfigSuppress function, we need to modify how we handle the suppression for PSC clusters.
Second, we need to ensure the enable_private_endpoint field is properly handled when private_endpoint_subnetwork is specified.
Here's the fix: 

// Modify the containerClusterPrivateClusterConfigSuppress function
func containerClusterPrivateClusterConfigSuppress(k, old, new string, d *schema.ResourceData) bool {
    o, n := d.GetChange("private_cluster_config.0.enable_private_endpoint")
    suppressEndpoint := !o.(bool) && !n.(bool)
 
    o, n = d.GetChange("private_cluster_config.0.enable_private_nodes")
    suppressNodes := !o.(bool) && !n.(bool)
 
    // Do not suppress diffs when private_endpoint_subnetwork is configured
    _, hasSubnet := d.GetOk("private_cluster_config.0.private_endpoint_subnetwork")
 
    // Do not suppress diffs when master_global_access_config is configured
    _, hasGlobalAccessConfig := d.GetOk("private_cluster_config.0.master_global_access_config")
 
    // Do not suppress diffs when master_ipv4_cidr_block is configured
    _, hasMasterCidr := d.GetOk("private_cluster_config.0.master_ipv4_cidr_block")
 
    if k == "private_cluster_config.0.enable_private_endpoint" {
        // Never suppress enable_private_endpoint when private_endpoint_subnetwork is set
        if hasSubnet {
            return false
        }
        return suppressEndpoint && !hasMasterCidr
    } else if k == "private_cluster_config.0.enable_private_nodes" {
        return suppressNodes && !hasSubnet && !hasMasterCidr
    } else if k == "private_cluster_config.#" {
        return suppressEndpoint && suppressNodes && !hasSubnet && !hasGlobalAccessConfig && !hasMasterCidr
    } else if k == "private_cluster_config.0.private_endpoint_subnetwork" {
        return tpgresource.CompareSelfLinkOrResourceName(k, old, new, d)
    }
    return false
}

And modify how we handle the private_cluster_config in the expansion:

func expandPrivateClusterConfigMasterIpv4CidrBlock(configured interface{}, c *container.Cluster) *container.PrivateClusterConfig {
    v := configured.(string)
    
    config := &container.PrivateClusterConfig{
        EnablePrivateNodes: c.NetworkConfig.DefaultEnablePrivateNodes,
        ForceSendFields:   []string{"EnablePrivateNodes"},
    }
 
    // If master_ipv4_cidr_block is set
    if v != "" {
        config.MasterIpv4CidrBlock = v
        config.ForceSendFields = append(config.ForceSendFields, "MasterIpv4CidrBlock")
    }
 
    // If private_endpoint_subnetwork is set, ensure enable_private_endpoint is explicitly set
    if subnetwork, ok := d.GetOk("private_cluster_config.0.private_endpoint_subnetwork"); ok {
        config.EnablePrivateEndpoint = false
        config.ForceSendFields = append(config.ForceSendFields, "EnablePrivateEndpoint")
    }
 
    return config
}
 
These changes should:
Prevent suppression of diffs when private_endpoint_subnetwork is specified
Ensure enable_private_endpoint is explicitly set when using PSC clusters
Better handle the state transitions between the plan and apply phases
The error you're seeing occurs because the provider is trying to be smart about when to send the enable_private_endpoint field to the API, but this causes inconsistencies in the state. The fix ensures that when using private_endpoint_subnetwork, we're explicit about the enable_private_endpoint value.
After applying these changes, you should be able to use either master_ipv4_cidr_block or private_endpoint_subnetwork without encountering the inconsistent plan error.
Remember that you can't use both master_ipv4_cidr_block and private_endpoint_subnetwork at the same time - they're mutually exclusive as they represent different types of private cluster configurations (traditional private GKE vs PSC-based private GKE).

References

No response

@ggtisc
Copy link
Collaborator

ggtisc commented Nov 22, 2024

Hi @juangascon!

I tried to replicate this issue with the following configuration, but everything was good without errors. Could you review it and make a new try?

resource "google_compute_network" "vpc_20429" {
  name = "vpc-20429"
  auto_create_subnetworks = false
  enable_ula_internal_ipv6 = true
}

resource "google_compute_subnetwork" "vpc_subnet_20429" { # this config is only available with ULA mandatory for 
  name                        = "vpc-subnet-20429"                                       # ipv6_access_type = "INTERNAL"
  region                      = "us-central1"
  network                     = google_compute_network.vpc_20429.name
  private_ip_google_access    = true
  ip_cidr_range               = "10.2.0.0/16"
  stack_type                  = "IPV4_IPV6"                                                  # change to ULA is only available through GCP console
  private_ipv6_google_access  = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type            = "INTERNAL"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "container_cluster_20429" {
  name = "container-cluster-20429"
  location = "us-central1"
  initial_node_count = 1
  deletion_protection = false
  network = google_compute_network.vpc_20429.name
  subnetwork = google_compute_subnetwork.vpc_subnet_20429.name

  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    private_endpoint_subnetwork = google_compute_subnetwork.vpc_subnet_20429.name

    master_global_access_config {
      enabled = false
    }
  }
}

@juangascon
Copy link
Author

Hello @ggtisc
Thanks for your answer.
I see that you use the same network for the cluster and for the private endpoint.
I will try and come back to you.
Though,I am quite surprised that this is allowed. I thought that this second network should be a different one that does not overlap any other and has to be a /28.
Well, that is what says the documentation for the master_ipv4_cidr_block.
Maybe that does not concern the "private_endpoint_subnetwork" attribute and we can share the same network for the nodes and the master endpoint.
Maybe the documentation should explicitly explain or be a little more clear on this point.

Again, I will try and come back to you. :-)

@juangascon
Copy link
Author

Hello @ggtisc
I tried the changes in my configuration and I got again the same error.
I thought that this could be because I am using the configuration as a module so I have a single version to control and make evolve and I can use it in several work projects without having to copy the files.
But it is not the reason. I have tried directly with the source configuration putting the same subnetwork for both the cluster and the control plane and I get the same error. 😢 So, I am again using 2 separate subnetworks for the nodes and the control plane.

As soon as I use master_ipv4_cidr_block = "192.168.40.16/28" or enable_private_endpoint = null the deployment finishes perfectly.
If I have the combination enable_private_endpoint = false with private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name, the deployment crashes. 😢

The source configuration runs over these versions:
Terraform v1.9.8
on linux_amd64

  • provider registry.terraform.io/hashicorp/google v4.60.2

My configuration with the last version of Terraform and the Google provider is as follows:

resource "google_compute_network" "prototype" {
  name                     = "${var.name_root}-${random_string.vpc_suffix.result}"
  auto_create_subnetworks  = "false"
  enable_ula_internal_ipv6 = true # fixes CKV_GCP_76 - side effect - needed to allow ipv6 Internal
}

resource "google_compute_subnetwork" "prototype" {
  lifecycle {
    ignore_changes = [
      secondary_ip_range,
    ]
  }
  name                     = local.subnetwork_prototype_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = "10.40.1.0/24"
  secondary_ip_range {
    range_name    = local.pods_range_name
    ip_cidr_range = "10.240.0.0/14"
  }
  secondary_ip_range {
    range_name    = local.services_range_name
    ip_cidr_range = "10.244.0.0/20"
  }

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config { # fixes CHK_GCP_26
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_compute_subnetwork" "cluster_control_plane" {
  name                     = local.control_plane_private_endpoint_subnet_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = "192.168.40.16/28"

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config { # fixes CHK_GCP_26
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "prototype" {
  lifecycle {
    ignore_changes = [
      node_config,
      ip_allocation_policy,
    ]
  }
  deletion_protection      = false
  name                     = local.cluster_prototype_name
  location                 = local.cluster_zone
  remove_default_node_pool = true
  initial_node_count       = 1
  network                  = google_compute_network.prototype.name
  subnetwork               = google_compute_subnetwork.prototype.name
  min_master_version       = data.google_container_engine_versions.prototype.release_channel_latest_version[var.cluster_release_channel] # fixes CKV_GCP_67
  release_channel {
    channel = var.cluster_release_channel
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  network_policy { # fixes CKV_GCP_12 - first part
    enabled = true
  }
  addons_config { # fixes CKV_GCP_12 - second part
    network_policy_config {
      disabled = false
    }
  }

  ip_allocation_policy {
    # Needed in private node
    # If the parameters are commented or not written or
    # if their values are empty (null), GCP will allocate them
    cluster_secondary_range_name  = local.pods_range_name
    services_secondary_range_name = local.services_range_name
  }

  private_cluster_config {
    enable_private_nodes        = true
    enable_private_endpoint     = null # Explicitly set to null to avoid a bug of the provider that
                                       # unexpectedly changes the attribute from null to false
                                       # during the apply phase
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }

  binary_authorization { # fixes CKV_GCP_66 with updated parameter
    evaluation_mode = var.cluster_binary_authorization ? "PROJECT_SINGLETON_POLICY_ENFORCE" : "DISABLED"
  }
  enable_intranode_visibility = true # fixes CHK_GCP_61
  enable_shielded_nodes       = true

  node_config {
    shielded_instance_config {
      enable_integrity_monitoring = true
      enable_secure_boot          = true
    }
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    labels = {
      name      = "catalyser"
      lifecycle = "ephemeral"
    }
    workload_metadata_config { # fixes CKV_GCP_69
      mode = "GKE_METADATA"
    }
    tags = ["ephemeral", "catalyser"]
  }

  vertical_pod_autoscaling {
    enabled = var.cluster_enable_vertical_pod_autoscaling
  }

  workload_identity_config { # fixes CKV_GCP_69 - side effect
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  resource_labels = {
    use_case = lower(var.use_case)
    creator  = lower(var.cluster_creator)
    scope    = "private"
  }
}

@ggtisc
Copy link
Collaborator

ggtisc commented Nov 27, 2024

I'm ready to make a new try, but there are missing values that I need to replicate your configuration. Could you provide me the following values or confirm if I can use any values and configurations?

  1. min_master_version = data.google_container_engine_versions.prototype.release_channel_latest_version[var.cluster_release_channel] # fixes CKV_GCP_67
  2. channel = var.cluster_release_channel
  3. cluster_secondary_range_name = local.pods_range_name
  4. services_secondary_range_name = local.services_range_name
  5. evaluation_mode = var.cluster_binary_authorization ? "PROJECT_SINGLETON_POLICY_ENFORCE" : "DISABLED"
  6. enabled = var.cluster_enable_vertical_pod_autoscaling
  7. use_case = lower(var.use_case)
  8. creator = lower(var.cluster_creator)

For sensitive data you can use examples like the following or specify that we can use any value and configuration:

@juangascon
Copy link
Author

@ggtisc
Just to let you know that I have seen you message.
You are true.
I shall try this night to respond with the values.

@juangascon
Copy link
Author

Hello @ggtisc
Indeed for some parameters, you can use the values that you want but for others, they have to be unique, so here are the values Terraform is using either explicitly given by me (tfvars) or by building them (data.tf and locals.tf)

  1. min_master_version = "1.31.1-gke.2105000"
  2. channel = "REGULAR"
  3. cluster_secondary_range_name = "gke-tf-poc-oj1u-pods"
  4. services_secondary_range_name = "gke-tf-poc-oj1u-services"
  5. evaluation_mode = "DISABLED"
  6. enabled = true
  7. use_case = "prototype"
  8. creator = "juangascon"

In the subnetwork "prototype" the range_name of the secondary_ip_range are the same as the secondary_range_name in the cluster:

  secondary_ip_range {
    range_name    = "gke-tf-poc-oj1u-pods"
    ip_cidr_range = "10.240.0.0/14"
  }
  secondary_ip_range {
    range_name    = "gke-tf-poc-oj1u-services"
    ip_cidr_range = "10.244.0.0/20"
  }

The variables use_case and cluster_creator can be whichever value because they are labels.
The values for cluster_secondary_range_name and services_secondary_range_name can be anything but they have to be unique as you know. That is why there is a random 4-character string in the name.
The other variables must have specific values from a list of possible values.

This is the code for the data.tf :

# Data fetched from the GCP resources
# Get the available zones in the project region
data "google_compute_zones" "available" {
  project = var.project_id
  region  = var.region
}

# Obtain the available cluster versions in the zone
data "google_container_engine_versions" "prototype" {
  project  = var.project_id
  location = local.cluster_zone
}

and a part of locals.tf

resource "random_string" "cluster_suffix" {
  # Suffix for the cluster name
  length  = 4
  lower   = true
  upper   = false
  numeric = true
  special = false
}

locals {
  # Build the cluster name
  cluster_prototype_name = "tf-poc-${random_string.cluster_suffix.result}"

  # Define the zone where the cluster will be deployed
  # the cluster will be deployed in the first available zone in the region
  cluster_zone = data.google_compute_zones.available.names[0]

  # Define Private control plane's endpoint subnet name
  control_plane_private_endpoint_subnet_name = "gke-${local.cluster_prototype_name}-cp-subnet"
}

@ggtisc ggtisc assigned NickElliot and unassigned ggtisc Nov 28, 2024
@ggtisc
Copy link
Collaborator

ggtisc commented Nov 28, 2024

I can't replicate this issue, so I'm passing it to the next on-call member @NickElliot

@juangascon
Copy link
Author

OK. I do not understand what is different on my side that creates this issue.
My cluster is a private one with a custom node pool, not the default one (remove_default_node_pool = true)
I do not know if this is an important point.

@NickElliot
Copy link
Collaborator

I'm not sure I understand the issue -- are you receiving the "unexpectedly change" error when you configure enable_private_endpoint to be explicitly null, or when it's absent from your state? And at which stage are you receiving the error? Could you provide the following two things in one post without excerpts from LLM models so it's easier to read:

your .tf config file

a log of your terminal from when you type "terraform apply" to when you receive the error message?

@juangascon
Copy link
Author

Sorry to answer this late. A very tragic personal issue has arrived to my family 15 days ago and I did not check on this. Sorry.

The problem comes when we have, in the "private_cluster_config" section, enable_private_endpoint = false AND we use private_endpoint_subnetwork parameter for the control plane subnetwork instead of master_ipv4_cidr_block.
In fact, if we run terraform apply right after it exits with error, the deployment ends correctly. Wonder why.
If we put enable_private_endpoint = null, the deployment ends smoothly in the first shot.

My configuration is in this post:
#20429 (comment)

Though, for the tests, you have to change enable_private_endpoint = false
I put enable_private_endpoint = null to make it work instead of the original enable_private_endpoint = false
The error arrives at the terraform apply phase. I will post the whole log when I shall run again the config in my Linux PC.

Thanks a lot for taking care of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug forward/review In review; remove label to forward service/container
Projects
None yet
Development

No branches or pull requests

4 participants