Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent failures causing node to panic #443

Open
danielloader opened this issue Aug 21, 2024 · 4 comments
Open

Intermittent failures causing node to panic #443

danielloader opened this issue Aug 21, 2024 · 4 comments

Comments

@danielloader
Copy link

danielloader commented Aug 21, 2024

Bit of a strange one but figured I'd ask, I've got an issue where if a nodejs microservice is idle for some time (not known how long yet for a pattern) and then receives a request which requires a hit on the /zitadel.management.v1.ManagementService/GetMyOrg method, there's a high chance it'll crash the nodejs process with an uncaught exception.

  • Happens using Node 22 and Node 20
  • Happens with the latest @zitadel/node package version.

If you need any more information I'll try to provide it for you.

I've enabled the following environment variables for information:

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

Debug level log for the failed call:

D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createResolvingCall [4600] method="/zitadel.management.v1.ManagementService/GetMyOrg", deadline=Infinity
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Created
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Deadline: Infinity
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] write() called with message of length 0
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] halfClose called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.quotech.io:443 createRetryingCall [4601] method="/zitadel.management.v1.ManagementService/GetMyOrg"
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | resolving_call | [4600] Created child [4601]
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createLoadBalancingCall [4602] method="/zitadel.management.v1.ManagementService/GetMyOrg"
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] Created child call [4602] for attempt 1
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] start called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] Pick called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] Pick result: COMPLETE subchannel: (8) 104.21.88.227:443 status: undefined undefined
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | retrying_call | [4601] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | load_balancing_call | [4602] startRead called
D 2024-08-21T12:49:03.834Z | v1.10.9 28 | transport_flowctrl | (3) 104.21.88.227:443 local window size: 61879 remote window size: 2147483407
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | transport_internals | (3) 104.21.88.227:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] Created child call [4603]
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | retrying_call | [4601] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] write() called with message of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] sending data chunk of length 5
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | retrying_call | [4601] halfClose called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | load_balancing_call | [4602] halfClose called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] end() called
D 2024-08-21T12:49:03.835Z | v1.10.9 28 | subchannel_call | [4603] calling end() on HTTP/2 stream
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | subchannel_call | [4603] HTTP/2 stream closed with code 8
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | subchannel_call | [4603] ended with status: code=1 details="Call cancelled"
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | load_balancing_call | [4602] Received status
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | load_balancing_call | [4602] ended with status: code=1 details="Call cancelled" start time=2024-08-21T12:49:03.834Z
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] Received status from child [4602]
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [4602] in state ACTIVE
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | retrying_call | [4601] ended with status: code=1 details="Call cancelled" start time=2024-08-21T12:49:03.834Z
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | resolving_call | [4600] Received status
D 2024-08-21T12:49:03.836Z | v1.10.9 28 | resolving_call | [4600] ended with status: code=1 details="Call cancelled"
node:internal/process/promises:391
triggerUncaughtException(err, true /* fromPromise */);
^
ClientError: /zitadel.management.v1.ManagementService/GetMyOrg CANCELLED: Call cancelled
at wrapClientError (/app/node_modules/nice-grpc/lib/client/wrapClientError.js:9:16)
at Object.callback (/app/node_modules/nice-grpc/lib/client/createUnaryMethod.js:27:66)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:193:36)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
at process.processTicksAndRejections (node:internal/process/task_queues:77:11) {
path: '/zitadel.management.v1.ManagementService/GetMyOrg',
code: 1,
details: 'Call cancelled'
}
Node.js v22.6.0
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

Debug level log for a successful call of the same endpoint (after hitting refresh a few times after the container is live again):

D 2024-08-21T12:53:49.368Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createResolvingCall [12] method="/zitadel.management.v1.ManagementService/ListUserMemberships", deadline=Infinity
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] Created
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] Deadline: Infinity
D 2024-08-21T12:53:49.368Z | v1.10.9 28 | resolving_call | [12] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] write() called with message of length 20
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] halfClose called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createRetryingCall [13] method="/zitadel.management.v1.ManagementService/ListUserMemberships"
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | resolving_call | [12] Created child [13]
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | channel | (1) dns:login.example.io:443 createLoadBalancingCall [14] method="/zitadel.management.v1.ManagementService/ListUserMemberships"
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] Created child call [14] for attempt 1
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] start called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Pick called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Pick result: COMPLETE subchannel: (1) 104.21.88.227:443 status: undefined undefined
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] startRead called
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | transport_flowctrl | (1) 104.21.88.227:443 local window size: 64976 remote window size: 2147483612
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | transport_internals | (1) 104.21.88.227:443 session.closed=false session.destroyed=false session.socket.destroyed=false
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] Created child call [15]
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | retrying_call | [13] write() called with message of length 25
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | load_balancing_call | [14] write() called with message of length 25
D 2024-08-21T12:53:49.369Z | v1.10.9 28 | subchannel_call | [15] write() called with message of length 25
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] sending data chunk of length 25
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | retrying_call | [13] halfClose called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | load_balancing_call | [14] halfClose called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] end() called
D 2024-08-21T12:53:49.370Z | v1.10.9 28 | subchannel_call | [15] calling end() on HTTP/2 stream
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | subchannel_call | [15] Received server headers:
:status: 200
date: Wed, 21 Aug 2024 12:53:49 GMT
content-type: application/grpc
cache-control: no-store
expires: Wed, 21 Aug 2024 12:53:49 GMT
grpc-metadata-cache-control: no-store
grpc-metadata-expires: Wed, 21 Aug 2024 12:53:49 GMT
grpc-metadata-pragma: no-cache
pragma: no-cache
trailer: Grpc-Status, Grpc-Message, Grpc-Status-Details-Bin
x-robots-tag: none
traceparent: 00-18f66a253324a3a3908ec887209acdf3-83531992162624ce-00
x-cloud-trace-context: 18f66a253324a3a3908ec887209acdf3/9462935357263062222
via: 1.1 google
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-cache-hit: uncacheable
alt-svc: h3=":443"; ma=86400
cf-cache-status: DYNAMIC
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=qCHEhO4Kiu8j1%2Ff7KzH4dNSa9W2ZC3cy3Voh5Ii%2BKyYQu4irGrfEnxa9RvQ4QgrEr34Du7W8Ql3RcaPgZl4ewrl%2FzUuNbq%2FlMVWmxR7jFvTtP6Aa7CKKIED5HKGZVyf45WzgG0CiLw%3D%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
expect-ct: max-age=86400, enforce
referrer-policy: same-origin
x-content-type-options: nosniff
x-frame-options: SAMEORIGIN
x-xss-protection: 1; mode=block
server: cloudflare
cf-ray: 8b6ac4a78b5c6545-LHR
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | load_balancing_call | [14] Received metadata
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | retrying_call | [13] Received metadata from child [14]
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | retrying_call | [13] Committing call [14] at index 0
D 2024-08-21T12:53:49.416Z | v1.10.9 28 | resolving_call | [12] Received metadata
D 2024-08-21T12:53:49.474Z | v1.10.9 28 | subchannel_call | [15] receive HTTP/2 data frame of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] parsed message of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] pushing to reader message of length 355
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] Received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | retrying_call | [13] Received message from child [14]
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | resolving_call | [12] Received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | resolving_call | [12] Finished filtering received message
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] Received server trailers:
grpc-status: 0
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] received status code 0 from server
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | subchannel_call | [15] ended with status: code=0 details=""
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] Received status
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | load_balancing_call | [14] ended with status: code=0 details="" start time=2024-08-21T12:53:49.369Z
D 2024-08-21T12:53:49.475Z | v1.10.9 28 | retrying_call | [13] Received status from child [14]
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | retrying_call | [13] state=COMMITTED handling status with progress PROCESSED from child [14] in state ACTIVE
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | retrying_call | [13] ended with status: code=0 details="" start time=2024-08-21T12:53:49.369Z
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | resolving_call | [12] Received status
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | resolving_call | [12] ended with status: code=0 details=""
D 2024-08-21T12:53:49.476Z | v1.10.9 28 | subchannel_call | [15] HTTP/2 stream closed with code 0
{"level":30,"time":1724244829480,"pid":28,"hostname":"risks-586d7dd84b-x9lhh","req":{"id":"afbc2f33-d4ff-47ac-989b-8e1c8a4357f0","method":"GET","url":"/insurance-products"},"res":{"statusCode":200},"responseTime":4,"msg":"request completed"}
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (2) [2606:4700:3031::6815:58e3]:443 refcount 1 -> 0
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (3) 172.67.153.180:443 refcount 1 -> 0
D 2024-08-21T12:53:58.862Z | v1.10.9 28 | subchannel_refcount | (4) [2606:4700:3033::ac43:99b4]:443 refcount 1 -> 0
@danielloader
Copy link
Author

I wondered if this is related: deeplay-io/nice-grpc#609

Though the package.json in this repo is already utilising [email protected] already.

@davidfiala
Copy link

davidfiala commented Aug 21, 2024

EDIT: looks like package-lock.json has grpc-js 1.10.8 at current head:

"version": "1.10.8",

My notes indicated some 1.10 versions may cause issues. deeplay-io/nice-grpc#609 (comment) though v1.10.8 was not listed at the time. I've forgotten why.

You could try bumping grpc-js to latest version and see what happens. But it may be a deeper issue, since I think I had a reason for listing only those bad 2 versions at the time and not a wider range.

@danielloader
Copy link
Author

danielloader commented Aug 21, 2024

D 2024-08-21T20:36:06.867Z | v1.10.9 28 | load_balancing_call | [22] Received metadata

Unfortunately as evidenced by this log, adding my own dependency hasn't changed the fact the zitadel-node package is using or wants to use 1.10.9.

Just ended up with both in the yarn.lock:


"@grpc/[email protected]":
  version "1.11.1"
  resolved "https://registry.yarnpkg.com/@grpc/grpc-js/-/grpc-js-1.11.1.tgz#a92f33e98f1959feffcd1b25a33b113d2c977b70"
  integrity sha512-gyt/WayZrVPH2w/UTLansS7F9Nwld472JxxaETamrM8HNlsa+jSLNyKAZmhxI2Me4c3mQHFiS1wWHDY1g1Kthw==
  dependencies:
    "@grpc/proto-loader" "^0.7.13"
    "@js-sdsl/ordered-map" "^4.4.2"

"@grpc/grpc-js@^1.10.8":
  version "1.10.9"
  resolved "https://registry.yarnpkg.com/@grpc/grpc-js/-/grpc-js-1.10.9.tgz#468cc1549a3fe37b760a16745fb7685d91f4f10c"
  integrity sha512-5tcgUctCG0qoNyfChZifz2tJqbRbXVO9J7X6duFcOjY3HUNCxg5D0ZCK7EP9vIcZ0zRpLU9bWkyCqVCLZ46IbQ==
  dependencies:
    "@grpc/proto-loader" "^0.7.13"
    "@js-sdsl/ordered-map" "^4.4.2"

Not sure how to override this to force it to use 1.11.1.

Edit: Have since learned about resolutions, added this to package.json, trying again:

  "resolutions": {
      "@grpc/grpc-js": "1.11.1"
  }

Edit2: Come back with bad news:

D 2024-08-21T20:52:05.326Z | v1.11.1 28 | retrying_call | [17] halfClose called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | load_balancing_call | [18] halfClose called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | subchannel_call | [19] end() called
D 2024-08-21T20:52:05.327Z | v1.11.1 28 | subchannel_call | [19] calling end() on HTTP/2 stream
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | subchannel_call | [19] HTTP/2 stream closed with code 8
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | subchannel_call | [19] ended with status: code=1 details="Call cancelled"
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | load_balancing_call | [18] Received status
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | load_balancing_call | [18] ended with status: code=1 details="Call cancelled" start time=2024-08-21T20:52:05.326Z
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] Received status from child [18]
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] state=TRANSPARENT_ONLY handling status with progress PROCESSED from child [18] in state ACTIVE
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | retrying_call | [17] ended with status: code=1 details="Call cancelled" start time=2024-08-21T20:52:05.326Z
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | resolving_call | [16] Received status
D 2024-08-21T20:52:05.328Z | v1.11.1 28 | resolving_call | [16] ended with status: code=1 details="Call cancelled"
node:internal/process/promises:391
    triggerUncaughtException(err, true /* fromPromise */);
    ^
ClientError: /zitadel.management.v1.ManagementService/GetMyOrg CANCELLED: Call cancelled
    at wrapClientError (/app/node_modules/nice-grpc/lib/client/wrapClientError.js:9:16)
    at Object.callback (/app/node_modules/nice-grpc/lib/client/createUnaryMethod.js:27:66)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client.js:193:36)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:360:141)
    at Object.onReceiveStatus (/app/node_modules/@grpc/grpc-js/build/src/client-interceptors.js:323:181)
    at /app/node_modules/@grpc/grpc-js/build/src/resolving-call.js:129:78
    at process.processTicksAndRejections (node:internal/process/task_queues:77:11) {
  path: '/zitadel.management.v1.ManagementService/GetMyOrg',
  code: 1,
  details: 'Call cancelled'
}
Node.js v22.6.0
error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.

same issue and the log is nice enough to show it's the newer version.

@danielloader
Copy link
Author

We've ended up wrapping the calls this library makes in our code with a try/catch and returning a 503 if there's an exception raised so the client can decide to retry the request.

If retry friendly exceptions were bubbled up from nice grpc it would be nice if this library had its own retry mechanism to attempt to smooth out transient issues but as this mitigates our customer affecting issue well enough I'm happy to just leave this issue up for future comments or closure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants