Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[scd] Simultaneous creation of temporally proximate intents seems to cause DSS to return 404 on DELETE even when intent is present and successfully deleted #1069

Open
callumdmay opened this issue Aug 15, 2024 · 7 comments

Comments

@callumdmay
Copy link

callumdmay commented Aug 15, 2024

Describe the bug
During load testing of our system simulating simultaneous intent creation originating at the same point at ~ 3QPS, we observed that subsequent deletion of that intent returned 404 errors from the DSS. On observation of outbound requests from our system, we only made the Delete request once. After confirming with @BradNicolle it appears that the delete was successful inside the DSS, even though a 404 was returned.

To Reproduce
There was nothing particularly special about our load testing configuration, it was 3 worker threads make Create() and then Delete() requests at random points in 1 second intervals, simulating ~3QPS. Each intent had an initial volume overlapping at the same geographical point, but at non-overlapping times.

Expected behavior
The DSS should return a 2XX response if a deletion was successful.

DSS URL: https://dss.us.pre-qual.interuss.opensky.dev
Operator: Wing
Intent ID: "0a5d60ef-cbd3-46a5-9241-200ded546fba"
OVN: "huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_"
Outbound request timestamp: 2024-08-08 12:44:29.974325-0700

@mickmis
Copy link
Contributor

mickmis commented Aug 20, 2024

Hi @callumdmay,

Thanks for your report. Could you clarify a few things to help us understand the issue? Notably:

  • What DSS version are you using?
  • What's the overall setup of your DSS pool?
  • Any chance the operational intent has expired between the time it was created and the time it was tentatively deleted?
  • Are you able to reliably reproduce the issue, does it appear randomly, or did it happen only once?
  • Do you have any logs at hand relevant to the issue?
  • Do you have a dump of the culprit request and response? Among other, the message in the 404 response could be interesting.

Thank you

@callumdmay
Copy link
Author

callumdmay commented Aug 27, 2024

What DSS version are you using?
v0.16.0-rc1-ce9e97d

What's the overall setup of your DSS pool?

Setup of the pool is a single cluster on GKE, using the default Tanka config (i.e. 3 pods of each of the core-service and cockroachdb).

Any chance the operational intent has expired between the time it was created and the time it was tentatively deleted?

No these actions were only 2 seconds apart

Are you able to reliably reproduce the issue, does it appear randomly, or did it happen only once?

I was able to reproduce it with a parallel load qps above 3 QPS

Do you have any logs at hand relevant to the issue?

Are you able to view https://cloudlogging.app.goo.gl/9jStWPozE5c7vHkC8?

image

Do you have a dump of the culprit request and response? Among other, the message in the 404 response could be interesting.

Request:

{
  "insertId": "funxwmf5jv4vj",
  "jsonPayload": {
    "statusDetails": "response_sent_by_backend",
    "@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
    "cacheDecision": [
      "RESPONSE_HAS_CONTENT_TYPE",
      "REQUEST_HAS_AUTHORIZATION",
      "CACHE_MODE_USE_ORIGIN_HEADERS"
    ],
    "remoteIp": "74.125.210.2",
    "backendTargetProjectNumber": "projects/229325454109"
  },
  "httpRequest": {
    "requestMethod": "PUT",
    "requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba",
    "requestSize": "99894",
    "status": 201,
    "responseSize": "1948",
    "userAgent": "Wing-DSS",
    "remoteIp": "74.125.210.2",
    "serverIp": "10.64.1.4",
    "latency": "1.176242s"
  },
  "resource": {
    "type": "http_load_balancer",
    "labels": {
      "url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "project_id": "wing-utm-external-staging",
      "forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
      "zone": "global"
    }
  },
  "timestamp": "2024-08-08T19:44:27.286542Z",
  "severity": "INFO",
  "logName": "projects/wing-utm-external-staging/logs/requests",
  "trace": "projects/wing-utm-external-staging/traces/0a277f79d85bc9aabeb827f26e9931eb",
  "receiveTimestamp": "2024-08-08T19:44:30.770288499Z",
  "spanId": "3c8b2db3a4a424e8"
}

Response:

{
  "insertId": "8musn9f5y0xt9",
  "jsonPayload": {
    "remoteIp": "74.125.210.1",
    "@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
    "backendTargetProjectNumber": "projects/229325454109",
    "statusDetails": "response_sent_by_backend",
    "cacheDecision": [
      "RESPONSE_HAS_CONTENT_TYPE",
      "REQUEST_HAS_AUTHORIZATION",
      "CACHE_MODE_USE_ORIGIN_HEADERS"
    ]
  },
  "httpRequest": {
    "requestMethod": "DELETE",
    "requestUrl": "https://dss.wing-internal.interuss.opensky.dev/dss/v1/operational_intent_references/0a5d60ef-cbd3-46a5-9241-200ded546fba/huHqtAVBkqIgErF2peFF-uhGsCvLatVyybgfKs8dgz0_",
    "requestSize": "1460",
    "status": 404,
    "responseSize": "325",
    "userAgent": "Wing-DSS",
    "remoteIp": "74.125.210.1",
    "serverIp": "10.64.2.3",
    "latency": "19.837230s"
  },
  "resource": {
    "type": "http_load_balancer",
    "labels": {
      "zone": "global",
      "forwarding_rule_name": "k8s2-fs-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "url_map_name": "k8s2-um-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "backend_service_name": "k8s1-fd370d60-dss-main-core-service-8080-ff9079c6",
      "target_proxy_name": "k8s2-ts-ecv8cqrz-dss-main-https-ingress-wtlkqi84",
      "project_id": "wing-utm-external-staging"
    }
  },
  "timestamp": "2024-08-08T19:44:29.333991Z",
  "severity": "WARNING",
  "logName": "projects/wing-utm-external-staging/logs/requests",
  "trace": "projects/wing-utm-external-staging/traces/231d30f78a0e53caba232f61d166213b",
  "receiveTimestamp": "2024-08-08T19:44:50.565327121Z",
  "spanId": "3826e6628999bceb"
}

@Shastick
Copy link
Contributor

Thank you for the details, @callumdmay.

Regarding the request and response you provided, they don't seem to contain the payload that was sent and received from the DSS, but instead contain what seems to be metadata pertaining to the system that is issuing the requests, or through which the requests flow?

The only directly relevant details they contain are the requestMethod, requestUrl and status in the httpRequest object.

Would it be possible for you to obtain the json payload that is sent in the first PUT, as well as the json response that is received by the DELETE that ends in a 404?

@BenjaminPelletier
Copy link
Member

I was going to say the issue could be that the op intent might not have been at the specified OVN, but actually I think we have a bug where the specified OVN is not checked when deleting an operational intent reference; @Shastick could you track this as a separate bug fix? We should return 404 (or possibly 409) with a helpful message if the op intent exists but the OVN doesn't match.

So, it seems like the issue is definitely that op intent 0a5d60ef-cbd3-46a5-9241-200ded546fba didn't exist at the time of the second request. @callumdmay are you sure there was no previous attempt to delete this operational intent? It would be extremely useful to search the DSS logs for instances of 0a5d60ef-cbd3-46a5-9241-200ded546fba between roughly 2024-08-08T19:44:27.286542Z and 2024-08-08T19:44:29.333991Z -- in addition to the two requests you've indicated above, I expect there is likely a third request to delete the operational intent reference between those two requests.

@callumdmay
Copy link
Author

I only see one delete:
image

@callumdmay
Copy link
Author

The log search is by OID, so all requests should appear

@BenjaminPelletier
Copy link
Member

Are those the DSS logs or the client USS logs? I'm expecting log messages formatted like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants