Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google_service_networking_connection uses servicenetworking.delete when google console utilizes networks.removePeering #18834

Open
joekiller opened this issue Jul 24, 2024 · 16 comments

Comments

@joekiller
Copy link

joekiller commented Jul 24, 2024

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to a user, that user is claiming responsibility for the issue.
  • Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version & Provider Version(s)

Terraform v1.9.0
on darwin_arm64

  • provider registry.terraform.io/hashicorp/google v5.38.0
  • provider registry.terraform.io/hashicorp/google-beta v5.38.0
  • cdktf v0.20.8

Affected Resource(s)

google_service_networking_connection

Terraform Configuration

The details in #16275 (comment) are still applicable.

Debug Output

No response

Expected Behavior

No response

Actual Behavior

No response

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

The issue "google_service_networking_connection destroy calls appear to always fail in 5.x despite guidance" (#16275) was closed by GoogleCloudPlatform/magic-modules#9765 despite that being a bandaid and not a real fix.

Researching this issue, it appears that the google console still utilizes the networks.removePeering as the struct indicates it is monitoring gceRemoveNetworkPeering

[
  {
    "results": [
      {
        "data": {
          "response": {
            "name": "operations/flow/abc123",
            "metadata": {
              "@type": "type.googleapis.com/com.google.cloud.clientapi.flows.api.FlowMetadataProto",
              "code": "PENDING",
              "phantomData": {
                "@type": "type.googleapis.com/google.protobuf.Struct",
                "value": {
                  "phantomRows": [
                    {
                      "networkUrl": "projects/redacted/global/networks/network-vpc",
                      "peeringName": "servicenetworking-googleapis-com",
                      "tooltipMessageType": "RESOURCE_BEING_DELETED"
                    }
                  ]
                }
              },
              "createTime": "2024-07-24T15:03:01.471Z",
              "description": {
                "descriptionKey": "gceRemoveNetworkPeering",
                "descriptionArgs": {
                  "resourceName": "servicenetworking-googleapis-com"
                }
              },
              "project": {
                "projectNumber": "redacted11111",
                "projectId": "redacted",
                "projectName": "redacted"
              },
              "flowType": "cloud-console.coliseum.poller"
            },
            "done": false,
            "error": null,
            "response": null,
            "result": "resultNotSet"
          }
        },
        "path": []
      }
    ],
    "responseContext": {
      "eti": ""
    }
  }
]'

@mike-callahan suggested that the situation dictates

either state is abandoned or add back removePeering functionality that uses the compute api

and the former is not a good solution. I advocate that the latter is implemented being that even google's console utilizes the latter technique.

There is a certain irony that even Google found this API call to not work and just worked around it instead of getting it fixed: GoogleCloudPlatform/magic-modules#8904

b/362749609

@joekiller joekiller added the bug label Jul 24, 2024
@github-actions github-actions bot added forward/review In review; remove label to forward service/service-networking labels Jul 24, 2024
@mike-callahan
Copy link

@joekiller I can't find the original issue with details. But we decided to go with abandon state because removePeering has its own set of problems. When a peering is removed it is not deleted. It then sits in this in-between state. There is a limit on the number of stale peerings that can exist. Customers were getting API errors because too many peerings existed (everytime they terraform applied and terraform destroyed a new one was created but not deleted).

While abandon doesn't "fix" the problem, it gives the terraform pipeline an opportunity to succeed without leaving any resources in an in-between state. A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)

@johanblumenberg
Copy link

100% reproducible using hashicorp/google v5.42.0 and terraform v1.9.2 on darwin_arm64 and linux_amd64.

To reproduce:

terraform apply -var project=$GOOGLE_PROJECT_ID
terraform destroy

Error message:

╷
│ Error: Unable to remove Service Networking Connection, err: Error waiting for Delete Service Networking Connection: Error code 9, message: Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection.
│ Help Token: AQAb6By_iOkEaXMwmQNzXPulToF-cG4Rtk9gdS7ayT_D8ziGThjT9lT_ZCwcIVUH7Usrmj_Fb400_krKcE2eV4ROLJftZKd8RtniyGZm1yy745HD
│ 
│ 
╵

Terraform config:

provider "google" {
  project = var.project
  region  = "europe-north1"
  zone    = "europe-north1-a"
}

variable "project" {
  default = "veritru-dev-332314"
}

resource "google_compute_network" "vpc" {
  name                    = "vpc"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "cloud_sql_ip_range" {
  project       = var.project
  provider      = google
  name          = "cloud-sql-ip-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  address       = "172.20.0.0"
  prefix_length = 16
  network       = google_compute_network.vpc.id
}

resource "google_service_networking_connection" "cloud_sql_vpc_connection" {
  network                 = google_compute_network.vpc.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.cloud_sql_ip_range.name]
}

resource "random_integer" "db_suffix" {
  min = 1000
  max = 1999
}
resource "google_sql_database_instance" "postgresql-db01" {
  project             = var.project
  name                = "postgresql-db-${random_integer.db_suffix.result}"
  region              = "europe-north1"
  database_version    = "POSTGRES_13"
  deletion_protection = false

  settings {
    tier = "db-f1-micro"

    location_preference {
      zone = "europe-north1-a"
    }

    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.vpc.self_link
    }
  }
  depends_on = [google_service_networking_connection.cloud_sql_vpc_connection]
}

@joekiller
Copy link
Author

@mike-callahan, I understand what you are saying and while I agree with the following vis a vie ensuring servicenetworking.delete works correctly I do not agree that choosing abandon and removing the networks.removePeering call as a better resolution.

A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)

As I mentioned originally, Google utilizes the networks.removePeering call when you click the delete peering button in the actual console.

I had deployed and fully removed a cloudsql instance from the project and was only able to remove the peering via the console, which utilized the networks.removePeering function. There was no association left. I agree that Google should fix the apparently dangling association. However, I do not think if a choice is to be made regarding relying on servicenetworking.delete instead of networks.removePeering vs abandoning the connection when even Google doesn't use servicenetworking.delete that abandoning is the best point of action.

I advocate that network.removePeering be used. I imagine that the in between situation where the peering connection is in flux due to infrastructure still utilizing it only existed when the dependsOn attribute was not properly applied which is specified in the documentation regarding this relationship needing to be specially applied when utilizing a private IP. I'm happy to change my opinion if you can surface the situations outside of that case. My deployment was properly using the dependsOn and deleted the cloudsql instance prior to attempting to delete the peering. Regardless of if I did the servicenetworking.delete two hours later or manually, it always failed. As soon as I used network.removePeering it cleaned up and didn't exist in a transient state.

@ggtisc
Copy link
Collaborator

ggtisc commented Aug 27, 2024

hi @joekiller!

I tried to replicate this issue with this example. But everything works fine without errors creating and destroying the resources. The unique difference is that I added this property: auto_create_subnetworks = false

This is the used terraform code, if you have something different share it with us to try to replicate this issue again:

resource "google_compute_network" "cn_18834" {
  name = "cn-18834"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "cga_18834" {
  name          = "cga-18834"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.cn_18834.id
}

resource "google_service_networking_connection" "snc_18834" {
  network                 = google_compute_network.cn_18834.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.cga_18834.name]
}

@joekiller
Copy link
Author

@ggtisc, I agree that it works in the limited context that you have cited. The the removal fails if you try to reproduce utilizing the example Johnan provided earlier or the config I referenced, #16275 (where the decision to abandon vs utilize removePeering).

My argument is that since Google console still uses networks.removePeering instead of servicenetworking.delete that it is reasonable for Terraform to also still use the removePeering method vs delete + abandon.

Yes this is very much in confluence with google_sql_database_instance, there may be other situations. If a user utilizes google_sql_database_instance with the appropriate depends_on notation it is reasonable for the user to expect to not to have to abandon the peering interface when it could be deleted.

@mike-callahan
Copy link

mike-callahan commented Aug 27, 2024

@mike-callahan
Copy link

mike-callahan commented Aug 27, 2024

Thanks for your thoughts @joekiller

I think it would be good to solve this "correctly". At least have terraform mirror the functionality of the console. I will investigate further. It would likely be a breaking change.

@joekiller
Copy link
Author

joekiller commented Aug 28, 2024

Thanks for fishing up those edge cases Mike. 😅 Seems like patching was definitely an improvement for those update situations. Hopefully a combination of patch for update, abandon options, and delete otherwise could generate the most satisfying conclusion.

@mike-callahan
Copy link

mike-callahan commented Aug 28, 2024

@rileykarson @zli82016 @c2thorn @roaks3

Service networking connection has been a persistent issue for a while now with various changes having unintended side-effects. We have used a lot of work arounds like abandon state and the latest update peering ranges fix. I propose we coordinate a comprehensive solution to this.

Firstly I want to document the behavior of a service networking connection flow with CloudSQL so we are all in agreement.

Private Service Access for CloudSQL and others works as follows (https://cloud.google.com/service-infrastructure/docs/enabling-private-services-access)

The consumer

  1. Enables Service Networking API
  2. Allocates an IP range
  3. Creates a "private connection to services"

From the consumer perspective step 3 does more than one thing including:

  1. Peering the consumer VPC with a Google managed VPC
  2. Importing routes to the subnet containing the managed instance (automatic/non-configurable)

The producer

  1. Creates a project for service networking
  2. Creates a project for CloudSQL
  3. Makes the service networking project a Shared VPC host
  4. Attaches the CloudSQL project to the Shared VPC host project
  5. Deploys a SQL instance onto the Shared VPC subnet
  6. Peers with the consumer project
  7. Exports routes (automatic/non-configurable)

Because Google managed services are deployed as service projects to a peered host project, only one service networking connection is required. New CloudSQL instances and other PSA services will deploy as service projects. That causes the following:

  1. Creating and deleting a service networking connection is actually creating and deleting a Google Shared VPC. Because CloudSQL is sitting on the Shared VPC it must be fully deleted before the Shared VPC can be deleted. This causes the behavior in issue-1947177558
  2. removePeering deletes the peering between the consumer VPC and the Google Shared VPC but does not delete the Google Shared VPC. This causes the behavior in b/305256825
  3. Consumers using a their own Shared VPC service projects unintentionally overwrite each others service networking configs issue 16735

@mike-callahan
Copy link

mike-callahan commented Aug 28, 2024

Users should be able to:

  1. Create and destroy CloudSQL instances repeatedly
  2. Consume a service networking connection without having to create it or hard-code its self-link
  3. Update ip allocations without breaking other users terraform

I propose some changes/additions to address these items:

  1. Offer a service_networking_connection data block. Network admins can create the service networking connection along with their VPC since its a VPC-wide configuration. Users can reference the datablock in their CloudSQL configuration instead of trying to create a new one.
  2. Allow the optional breakout of the networking connection config to allow additive ranges (basically the ask in issue 16735. issue 10830 being fixed and available in 6.0 set us up well for this by fixing the default CRUD action.
  3. Update the CloudSQL terraform documentation. The documentation inadvertently suggests that a new service_networking_connection should exist for each CloudSQL instance. It should clearly state that service_networking_connection is a VPC wide resource.

Example

# Keep the current google_service_networking_connection as the CRUD resource
resource "google_service_networking_connection" "default" {
  network                = google_compute_network.peering_network.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}

# Have an accompanying data resource 
data "google_service_networking_connection" "default" {
  network                = google_compute_network.peering_network.id
  name                 = "servicenetworking-googleapis-com"
}

# Offer an additive config option
# Having multiple of these blocks will append IP ranges to service networking connection
resource "google_service_networking_connection_config" "default" {
  network                = google_compute_network.peering_network.id
  name                 =  data.google_service_networking_connection.default
  reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}

@mike-callahan
Copy link

mike-callahan commented Aug 28, 2024

As an aside, after digging more into this and looking back at the beginning of this issue I don't believe the console uses removePeering. It looks to use servicenetworking.delete:
Screenshot 2024-08-27 at 9 06 35 PM

(Assuming you delete it from the servicenetworking pane)
Screenshot 2024-08-27 at 9 11 29 PM

@mike-callahan
Copy link

mike-callahan commented Aug 28, 2024

If you delete the VPC peering directly (equivalent of removePeering) (by using the "VPC NETWORK PEERING" pane) instead of using the service networking pane it will "delete" the service networking connection. The connection is still likely there though and this is a GUI bug (might be the cause of #15260 and b/305256825?).
Screenshot 2024-08-27 at 9 13 14 PM
Screenshot 2024-08-27 at 9 13 35 PM

@mike-callahan
Copy link

Directly deleting the VPC peering does seem to be an issue as it left CloudSQL in an error state
Screenshot 2024-08-27 at 9 46 22 PM

Interestingly I can add back the peering manually and it accepts
Screenshot 2024-08-27 at 9 52 27 PM

If I then try to re-create the service networking connection after the peering is added manually it claims "success" but nothing happens. Looks to be in a broken state
Screenshot 2024-08-27 at 9 55 54 PM
Screenshot 2024-08-27 at 9 55 58 PM

@ggtisc ggtisc assigned mike-callahan and unassigned ggtisc Aug 28, 2024
@ggtisc
Copy link
Collaborator

ggtisc commented Aug 28, 2024

Is it good to forward this issue, or just leave it as it is for now?

@mike-callahan
Copy link

Sure thanks

@roaks3
Copy link
Collaborator

roaks3 commented Aug 28, 2024

Yea, let's forward this to the service team so that they can weigh in on the proposed solution (FWIW the 3 changes do make sense to me, and don't look like breaking changes).

@mike-callahan I'm going to unassign you for now, and if you end up working on an implementation to solve this, we can add you back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants