Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

juangascon · 2024-11-21T18:38:56Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to a user, that user is claiming responsibility for the issue.
Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.9.8
on linux_amd64

provider registry.terraform.io/hashicorp/google v6.10.0

This happens also in both 5.44.2 and 6.1.0 versions of the provider.
The issue happens probably since the v5.18 when the attribute private_endpoint_subnetwork becomes Optional and not Read-Only.
I do not know if this has a link with issue #15422

Affected Resource(s)

This happens in the google_container_cluster resource when configured as private and we configure the IP range of the control-plane subnetwork using the attribute private_endpoint_subnetwork instead of master_ipv4_cidr_block.

We have two ways of giving the CIDR range for the control plane endpoint:

master_ipv4_cidr_block
private_endpoint_subnetwork

Reading the documentation from GCP "Create a cluster and select the control plane IP address range" it is said:

If you use the private-endpoint-subnetwork flag, GKE provisions the control plane internal endpoint with an IP address from the range that you define.
If you use the master-ipv4-cidr flag, GKE creates a new subnet from the values that you provide. GKE provisions the control plane internal endpoint with an IP address from this new range.

So, if your organization's security constraints forces you to activate the VPC Logs in all subnetworks, you, and not GCP, have to create the subnet in order to toggle on the feature with Terraform. Terraform can't modify (it is REALLY complicated) the parameters of a resource created out of its scope.
Though, if I create a subnet and I put its name as value for the private_endpoint_subnetwork attribute, I get the following error:

│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for module.gke.google_container_cluster.prototype to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/google"
│ produced an invalid new value for .private_cluster_config[0].enable_private_endpoint: was null, but now cty.False.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

Gemini 1.5 Pro explains the error:
The error message indicates that the google_container_cluster resource's private_cluster_config.enable_private_endpoint attribute is unexpectedly changing from null to false during the apply phase, even though it's not explicitly defined in your configuration.

Claude Sonnet 3.5 details even more:
The problem seems to be in how the provider handles the private_cluster_config state during plan and apply phases, specifically around the PSC (Private Service Connect) clusters.

While the bug is corrected in the provider, a quick bypass working solution proposed by Gemini 1.5 Pro is to explicitly set enable_private_endpoint to null in your configuration:

resource "google_container_cluster" "prototype" {
  # ... other configurations ...
  private_cluster_config {
    enable_private_nodes        = true
    enable_private_endpoint     = null # Explicitly set to null
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }
  # ... rest of your configuration ...
}

Terraform Configuration

resource "google_compute_subnetwork" "cluster_control_plane" {
  name                     = local.control_plane_private_endpoint_subnet_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = var.private_control_plane_subnetwork_ip_cidr_range

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "prototype" {
        // ... other GKE configurations ...
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }
        // ... other GKE configurations ...
}

Debug Output

No response

Expected Behavior

The GKE deployment should end correctly with the custom user-created subnetwork associated to the private control plane.

Actual Behavior

The deployment stops with the following error:

│ Error: Provider produced inconsistent final plan
│ 
│ When expanding the plan for module.gke.google_container_cluster.prototype to include new values learned so far during apply, provider "registry.terraform.io/hashicorp/google"
│ produced an invalid new value for .private_cluster_config[0].enable_private_endpoint: was null, but now cty.False.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

Steps to reproduce

configure the google_container_cluster as a private cluster using the private_cluster_config block. There, we declare the subnet for the private control plane using the attribute private_endpoint_subnetwork instead of declaring a CIDR IP range with the attribute master_ipv4_cidr_block.
terraform apply

Important Factoids

It seems that the file where the bug exists is : resource_container_cluster.go

I am not a Golang coder so, I do not know if the solution proposed by Claude Sonnet 3.5 via GitHub Copilot is good, neither I can submit a pull request.

Claude Sonnet 3.5 says:

Here's what needs to be fixed:
First, in the containerClusterPrivateClusterConfigSuppress function, we need to modify how we handle the suppression for PSC clusters.
Second, we need to ensure the enable_private_endpoint field is properly handled when private_endpoint_subnetwork is specified.
Here's the fix: 

// Modify the containerClusterPrivateClusterConfigSuppress function
func containerClusterPrivateClusterConfigSuppress(k, old, new string, d *schema.ResourceData) bool {
    o, n := d.GetChange("private_cluster_config.0.enable_private_endpoint")
    suppressEndpoint := !o.(bool) && !n.(bool)
 
    o, n = d.GetChange("private_cluster_config.0.enable_private_nodes")
    suppressNodes := !o.(bool) && !n.(bool)
 
    // Do not suppress diffs when private_endpoint_subnetwork is configured
    _, hasSubnet := d.GetOk("private_cluster_config.0.private_endpoint_subnetwork")
 
    // Do not suppress diffs when master_global_access_config is configured
    _, hasGlobalAccessConfig := d.GetOk("private_cluster_config.0.master_global_access_config")
 
    // Do not suppress diffs when master_ipv4_cidr_block is configured
    _, hasMasterCidr := d.GetOk("private_cluster_config.0.master_ipv4_cidr_block")
 
    if k == "private_cluster_config.0.enable_private_endpoint" {
        // Never suppress enable_private_endpoint when private_endpoint_subnetwork is set
        if hasSubnet {
            return false
        }
        return suppressEndpoint && !hasMasterCidr
    } else if k == "private_cluster_config.0.enable_private_nodes" {
        return suppressNodes && !hasSubnet && !hasMasterCidr
    } else if k == "private_cluster_config.#" {
        return suppressEndpoint && suppressNodes && !hasSubnet && !hasGlobalAccessConfig && !hasMasterCidr
    } else if k == "private_cluster_config.0.private_endpoint_subnetwork" {
        return tpgresource.CompareSelfLinkOrResourceName(k, old, new, d)
    }
    return false
}

And modify how we handle the private_cluster_config in the expansion:

func expandPrivateClusterConfigMasterIpv4CidrBlock(configured interface{}, c *container.Cluster) *container.PrivateClusterConfig {
    v := configured.(string)
    
    config := &container.PrivateClusterConfig{
        EnablePrivateNodes: c.NetworkConfig.DefaultEnablePrivateNodes,
        ForceSendFields:   []string{"EnablePrivateNodes"},
    }
 
    // If master_ipv4_cidr_block is set
    if v != "" {
        config.MasterIpv4CidrBlock = v
        config.ForceSendFields = append(config.ForceSendFields, "MasterIpv4CidrBlock")
    }
 
    // If private_endpoint_subnetwork is set, ensure enable_private_endpoint is explicitly set
    if subnetwork, ok := d.GetOk("private_cluster_config.0.private_endpoint_subnetwork"); ok {
        config.EnablePrivateEndpoint = false
        config.ForceSendFields = append(config.ForceSendFields, "EnablePrivateEndpoint")
    }
 
    return config
}
 
These changes should:
Prevent suppression of diffs when private_endpoint_subnetwork is specified
Ensure enable_private_endpoint is explicitly set when using PSC clusters
Better handle the state transitions between the plan and apply phases
The error you're seeing occurs because the provider is trying to be smart about when to send the enable_private_endpoint field to the API, but this causes inconsistencies in the state. The fix ensures that when using private_endpoint_subnetwork, we're explicit about the enable_private_endpoint value.
After applying these changes, you should be able to use either master_ipv4_cidr_block or private_endpoint_subnetwork without encountering the inconsistent plan error.
Remember that you can't use both master_ipv4_cidr_block and private_endpoint_subnetwork at the same time - they're mutually exclusive as they represent different types of private cluster configurations (traditional private GKE vs PSC-based private GKE).

References

No response

The text was updated successfully, but these errors were encountered:

ggtisc · 2024-11-22T19:50:21Z

Hi @juangascon!

I tried to replicate this issue with the following configuration, but everything was good without errors. Could you review it and make a new try?

resource "google_compute_network" "vpc_20429" {
  name = "vpc-20429"
  auto_create_subnetworks = false
  enable_ula_internal_ipv6 = true
}

resource "google_compute_subnetwork" "vpc_subnet_20429" { # this config is only available with ULA mandatory for 
  name                        = "vpc-subnet-20429"                                       # ipv6_access_type = "INTERNAL"
  region                      = "us-central1"
  network                     = google_compute_network.vpc_20429.name
  private_ip_google_access    = true
  ip_cidr_range               = "10.2.0.0/16"
  stack_type                  = "IPV4_IPV6"                                                  # change to ULA is only available through GCP console
  private_ipv6_google_access  = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type            = "INTERNAL"

  log_config {
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "container_cluster_20429" {
  name = "container-cluster-20429"
  location = "us-central1"
  initial_node_count = 1
  deletion_protection = false
  network = google_compute_network.vpc_20429.name
  subnetwork = google_compute_subnetwork.vpc_subnet_20429.name

  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    private_endpoint_subnetwork = google_compute_subnetwork.vpc_subnet_20429.name

    master_global_access_config {
      enabled = false
    }
  }
}

juangascon · 2024-11-25T17:07:31Z

Hello @ggtisc
Thanks for your answer.
I see that you use the same network for the cluster and for the private endpoint.
I will try and come back to you.
Though,I am quite surprised that this is allowed. I thought that this second network should be a different one that does not overlap any other and has to be a /28.
Well, that is what says the documentation for the master_ipv4_cidr_block.
Maybe that does not concern the "private_endpoint_subnetwork" attribute and we can share the same network for the nodes and the master endpoint.
Maybe the documentation should explicitly explain or be a little more clear on this point.

Again, I will try and come back to you. :-)

juangascon · 2024-11-26T22:27:39Z

Hello @ggtisc
I tried the changes in my configuration and I got again the same error.
I thought that this could be because I am using the configuration as a module so I have a single version to control and make evolve and I can use it in several work projects without having to copy the files.
But it is not the reason. I have tried directly with the source configuration putting the same subnetwork for both the cluster and the control plane and I get the same error. 😢 So, I am again using 2 separate subnetworks for the nodes and the control plane.

As soon as I use master_ipv4_cidr_block = "192.168.40.16/28" or enable_private_endpoint = null the deployment finishes perfectly.
If I have the combination enable_private_endpoint = false with private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name, the deployment crashes. 😢

The source configuration runs over these versions:
Terraform v1.9.8
on linux_amd64

provider registry.terraform.io/hashicorp/google v4.60.2

My configuration with the last version of Terraform and the Google provider is as follows:

resource "google_compute_network" "prototype" {
  name                     = "${var.name_root}-${random_string.vpc_suffix.result}"
  auto_create_subnetworks  = "false"
  enable_ula_internal_ipv6 = true # fixes CKV_GCP_76 - side effect - needed to allow ipv6 Internal
}

resource "google_compute_subnetwork" "prototype" {
  lifecycle {
    ignore_changes = [
      secondary_ip_range,
    ]
  }
  name                     = local.subnetwork_prototype_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = "10.40.1.0/24"
  secondary_ip_range {
    range_name    = local.pods_range_name
    ip_cidr_range = "10.240.0.0/14"
  }
  secondary_ip_range {
    range_name    = local.services_range_name
    ip_cidr_range = "10.244.0.0/20"
  }

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config { # fixes CHK_GCP_26
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_compute_subnetwork" "cluster_control_plane" {
  name                     = local.control_plane_private_endpoint_subnet_name
  region                   = var.region
  network                  = google_compute_network.prototype.name
  private_ip_google_access = true
  ip_cidr_range            = "192.168.40.16/28"

  stack_type                 = "IPV4_IPV6"
  private_ipv6_google_access = "ENABLE_OUTBOUND_VM_ACCESS_TO_GOOGLE"
  ipv6_access_type           = "INTERNAL"

  log_config { # fixes CHK_GCP_26
    aggregation_interval = "INTERVAL_10_MIN"
    flow_sampling        = 0.5
    metadata             = "INCLUDE_ALL_METADATA"
  }
}

resource "google_container_cluster" "prototype" {
  lifecycle {
    ignore_changes = [
      node_config,
      ip_allocation_policy,
    ]
  }
  deletion_protection      = false
  name                     = local.cluster_prototype_name
  location                 = local.cluster_zone
  remove_default_node_pool = true
  initial_node_count       = 1
  network                  = google_compute_network.prototype.name
  subnetwork               = google_compute_subnetwork.prototype.name
  min_master_version       = data.google_container_engine_versions.prototype.release_channel_latest_version[var.cluster_release_channel] # fixes CKV_GCP_67
  release_channel {
    channel = var.cluster_release_channel
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  network_policy { # fixes CKV_GCP_12 - first part
    enabled = true
  }
  addons_config { # fixes CKV_GCP_12 - second part
    network_policy_config {
      disabled = false
    }
  }

  ip_allocation_policy {
    # Needed in private node
    # If the parameters are commented or not written or
    # if their values are empty (null), GCP will allocate them
    cluster_secondary_range_name  = local.pods_range_name
    services_secondary_range_name = local.services_range_name
  }

  private_cluster_config {
    enable_private_nodes        = true
    enable_private_endpoint     = null # Explicitly set to null to avoid a bug of the provider that
                                       # unexpectedly changes the attribute from null to false
                                       # during the apply phase
    private_endpoint_subnetwork = google_compute_subnetwork.cluster_control_plane.name
    master_global_access_config {
      enabled = false
    }
  }

  binary_authorization { # fixes CKV_GCP_66 with updated parameter
    evaluation_mode = var.cluster_binary_authorization ? "PROJECT_SINGLETON_POLICY_ENFORCE" : "DISABLED"
  }
  enable_intranode_visibility = true # fixes CHK_GCP_61
  enable_shielded_nodes       = true

  node_config {
    shielded_instance_config {
      enable_integrity_monitoring = true
      enable_secure_boot          = true
    }
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    labels = {
      name      = "catalyser"
      lifecycle = "ephemeral"
    }
    workload_metadata_config { # fixes CKV_GCP_69
      mode = "GKE_METADATA"
    }
    tags = ["ephemeral", "catalyser"]
  }

  vertical_pod_autoscaling {
    enabled = var.cluster_enable_vertical_pod_autoscaling
  }

  workload_identity_config { # fixes CKV_GCP_69 - side effect
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  resource_labels = {
    use_case = lower(var.use_case)
    creator  = lower(var.cluster_creator)
    scope    = "private"
  }
}

ggtisc · 2024-11-27T00:24:16Z

I'm ready to make a new try, but there are missing values that I need to replicate your configuration. Could you provide me the following values or confirm if I can use any values and configurations?

min_master_version = data.google_container_engine_versions.prototype.release_channel_latest_version[var.cluster_release_channel] # fixes CKV_GCP_67
channel = var.cluster_release_channel
cluster_secondary_range_name = local.pods_range_name
services_secondary_range_name = local.services_range_name
evaluation_mode = var.cluster_binary_authorization ? "PROJECT_SINGLETON_POLICY_ENFORCE" : "DISABLED"
enabled = var.cluster_enable_vertical_pod_autoscaling
use_case = lower(var.use_case)
creator = lower(var.cluster_creator)

For sensitive data you can use examples like the following or specify that we can use any value and configuration:

project = "my-project"
user = "[email protected]"

juangascon · 2024-11-27T16:56:24Z

@ggtisc
Just to let you know that I have seen you message.
You are true.
I shall try this night to respond with the values.

juangascon · 2024-11-28T19:40:30Z

Hello @ggtisc
Indeed for some parameters, you can use the values that you want but for others, they have to be unique, so here are the values Terraform is using either explicitly given by me (tfvars) or by building them (data.tf and locals.tf)

min_master_version = "1.31.1-gke.2105000"
channel = "REGULAR"
cluster_secondary_range_name = "gke-tf-poc-oj1u-pods"
services_secondary_range_name = "gke-tf-poc-oj1u-services"
evaluation_mode = "DISABLED"
enabled = true
use_case = "prototype"
creator = "juangascon"

In the subnetwork "prototype" the range_name of the secondary_ip_range are the same as the secondary_range_name in the cluster:

  secondary_ip_range {
    range_name    = "gke-tf-poc-oj1u-pods"
    ip_cidr_range = "10.240.0.0/14"
  }
  secondary_ip_range {
    range_name    = "gke-tf-poc-oj1u-services"
    ip_cidr_range = "10.244.0.0/20"
  }

The variables use_case and cluster_creator can be whichever value because they are labels.
The values for cluster_secondary_range_name and services_secondary_range_name can be anything but they have to be unique as you know. That is why there is a random 4-character string in the name.
The other variables must have specific values from a list of possible values.

This is the code for the data.tf :

# Data fetched from the GCP resources
# Get the available zones in the project region
data "google_compute_zones" "available" {
  project = var.project_id
  region  = var.region
}

# Obtain the available cluster versions in the zone
data "google_container_engine_versions" "prototype" {
  project  = var.project_id
  location = local.cluster_zone
}

and a part of locals.tf

resource "random_string" "cluster_suffix" {
  # Suffix for the cluster name
  length  = 4
  lower   = true
  upper   = false
  numeric = true
  special = false
}

locals {
  # Build the cluster name
  cluster_prototype_name = "tf-poc-${random_string.cluster_suffix.result}"

  # Define the zone where the cluster will be deployed
  # the cluster will be deployed in the first available zone in the region
  cluster_zone = data.google_compute_zones.available.names[0]

  # Define Private control plane's endpoint subnet name
  control_plane_private_endpoint_subnet_name = "gke-${local.cluster_prototype_name}-cp-subnet"
}

ggtisc · 2024-11-28T20:33:31Z

I can't replicate this issue, so I'm passing it to the next on-call member @NickElliot

juangascon · 2024-11-28T21:43:32Z

OK. I do not understand what is different on my side that creates this issue.
My cluster is a private one with a custom node pool, not the default one (remove_default_node_pool = true)
I do not know if this is an important point.

NickElliot · 2024-12-06T22:39:24Z

I'm not sure I understand the issue -- are you receiving the "unexpectedly change" error when you configure enable_private_endpoint to be explicitly null, or when it's absent from your state? And at which stage are you receiving the error? Could you provide the following two things in one post without excerpts from LLM models so it's easier to read:

your .tf config file

a log of your terminal from when you type "terraform apply" to when you receive the error message?

juangascon · 2024-12-16T16:02:45Z

Sorry to answer this late. A very tragic personal issue has arrived to my family 15 days ago and I did not check on this. Sorry.

The problem comes when we have, in the "private_cluster_config" section, enable_private_endpoint = false AND we use private_endpoint_subnetwork parameter for the control plane subnetwork instead of master_ipv4_cidr_block.
In fact, if we run terraform apply right after it exits with error, the deployment ends correctly. Wonder why.
If we put enable_private_endpoint = null, the deployment ends smoothly in the first shot.

My configuration is in this post:
#20429 (comment)

Though, for the tests, you have to change enable_private_endpoint = false
I put enable_private_endpoint = null to make it work instead of the original enable_private_endpoint = false
The error arrives at the terraform apply phase. I will post the whole log when I shall run again the config in my Linux PC.

Thanks a lot for taking care of this.

juangascon added the bug label Nov 21, 2024

github-actions bot added forward/review In review; remove label to forward service/container labels Nov 21, 2024

juangascon mentioned this issue Nov 21, 2024

Allow Flow Logs activation on private_cluster_config subnet #19804

Open

ggtisc self-assigned this Nov 22, 2024

ggtisc added the waiting-response label Nov 22, 2024

github-actions bot removed the waiting-response label Nov 25, 2024

ggtisc added the waiting-response label Nov 25, 2024

github-actions bot removed the waiting-response label Nov 26, 2024

ggtisc added the waiting-response label Nov 27, 2024

github-actions bot removed the waiting-response label Nov 27, 2024

ggtisc added the waiting-response label Nov 27, 2024

github-actions bot removed the waiting-response label Nov 28, 2024

ggtisc assigned NickElliot and unassigned ggtisc Nov 28, 2024

NickElliot added the waiting-response label Dec 6, 2024

github-actions bot removed the waiting-response label Dec 6, 2024

NickElliot assigned c2thorn and unassigned NickElliot Dec 6, 2024

c2thorn added the waiting-response label Dec 9, 2024

github-actions bot removed the waiting-response label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

juangascon commented Nov 21, 2024

ggtisc commented Nov 22, 2024

juangascon commented Nov 25, 2024

juangascon commented Nov 26, 2024

ggtisc commented Nov 27, 2024

juangascon commented Nov 27, 2024

juangascon commented Nov 28, 2024

ggtisc commented Nov 28, 2024 •

edited

Loading

juangascon commented Nov 28, 2024

NickElliot commented Dec 6, 2024

juangascon commented Dec 16, 2024

Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

Use of private_endpoint_subnetwork in private GKE stops deployment with error #20429

Comments

juangascon commented Nov 21, 2024

Community Note

Terraform Version & Provider Version(s)

Affected Resource(s)

Terraform Configuration

Debug Output

Expected Behavior

Actual Behavior

Steps to reproduce

Important Factoids

References

ggtisc commented Nov 22, 2024

juangascon commented Nov 25, 2024

juangascon commented Nov 26, 2024

ggtisc commented Nov 27, 2024

juangascon commented Nov 27, 2024

juangascon commented Nov 28, 2024

ggtisc commented Nov 28, 2024 • edited Loading

juangascon commented Nov 28, 2024

NickElliot commented Dec 6, 2024

juangascon commented Dec 16, 2024

ggtisc commented Nov 28, 2024 •

edited

Loading