Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading > 1 GiB files from buildomat fails some (most?) of the time #36

Open
jgallagher opened this issue Sep 5, 2023 · 11 comments
Assignees

Comments

@jgallagher
Copy link

Attempting to curl this URL from atrium https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip fails some of the time at almost exactly 1 GiB:

% curl -OL https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 90 1146M   90 1034M    0     0  4079k      0  0:04:47  0:04:19  0:00:28 5366k
curl: (18) HTTP/2 stream 1 was reset

It doesn't always fail, particularly from the lab, but it seems to usually fail from outside the lab. From @jmpesp:

$ curl -OL https://buildomat.eng.oxide.computer/wg/0/artefact/01H8QBDTBWH4XS1ZSXBP9N47V7/Telf7HF0eNoZK443nD4ykEv6gbv3tqvhvPvFZXEae2g7yxZp/01H8QBFKMQ9VYJDS1M45Q66ZAY/01H8QJ02TAB8APZPVY7262W2Z4/repo-pvt1.zip
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 89 1146M   89 1030M    0     0  1477k      0  0:13:14  0:11:54  0:01:20  796k
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

@augustuswm (on this exact URL) and @leftwo (on different buildomat URLs pointing to artifacts of similar size) also reported similar errors.

@jgallagher
Copy link
Author

That might have a different underlying cause; dendrite-asic.tar.gz is much less than 1 GiB (I think??), and in the original issue every reproduction failed at just over 1 GiB. I agree the error message looks like the symptom is the same though, so maybe I'm overindexing on the 1 GiB thing.

@jclulow
Copy link
Collaborator

jclulow commented Mar 25, 2024

I have (finally) redone the downloading logic a bit here for two reasons:

  • to support logging a message at the end of a download, whether it succeeded or failed, including some statistics (e.g., how long it took, how many bytes we expected, how many we actually sent, and whether the client or the backend ended up terminating the connection if it failed)
  • to support HTTP range requests for resuming downloads (see also Support HTTP range requests for downloading artifacts #31)

We'll get log messages like this, now, if there's an interrupted connection:

06:20:05.188Z ERRO github-server: download failed: published file: owner oxidecomputer/omicron series rot-all version b4e1a285ef812bc0376959e177c7ab3f90893e73 name repo.zip.parta: interrupted on client side
    bytes_expected = 1023973427
    bytes_transferred = 11763402
    download = url
    hdr_x_forwarded_for = 66.117.152.2:12520
    local_addr = 0.0.0.0:4021
    method = GET
    msec = 2362
    offset = 49768397
    rate_mb = 4.748702113165932
    remote_addr = 172.31.43.126:46187
    req_id = 59608fab-62d2-446b-b5ec-cf2eab1845c3
    uri = /public/file/oxidecomputer/omicron/rot-all/b4e1a285ef812bc0376959e177c7ab3f90893e73/repo.zip.parta
06:20:05.191Z ERRO buildomat: download failed: published file: user 01FV089DQ9F11ETVWFXWW3GYAD series rot-all version b4e1a285ef812bc0376959e177c7ab3f90893e73 name repo.zip.parta: interrupted on client side
    bytes_expected = 1023973427
    bytes_transferred = 17287276
    download = s3
    hdr_x_forwarded_for = 44.227.183.26:21131
    local_addr = 0.0.0.0:9979
    method = GET
    msec = 2368
    offset = 49768397
    rate_mb = 6.961349156131005
    remote_addr = 172.31.43.126:50081
    req_id = 1bf03bcf-bf68-4bae-ba9e-a915709fba10
    uri = /0/public/file/gong-238580629/rot-all/b4e1a285ef812bc0376959e177c7ab3f90893e73/repo.zip.parta

I have tested interrupting and resuming downloads fairly extensively using curl -C - ... and I think it's all pretty solid. I also haven't been able to reproduce this issue, at least today while screwing around, so maybe the underlying cause is not as prominent anymore (touch wood!).

Either way, I think we need to start trying to do this routinely again now that the instrumentation is in place so that we can catch it again if it's still occurring.

@jgallagher
Copy link
Author

This happened to me downloading from jeeves; unfortunately I was running curl -fsSL so all I got error-wise was:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Just wanted to note this here in case having a timestamp is useful (this error is from a few minutes before I posted this).

@jclulow
Copy link
Collaborator

jclulow commented Apr 15, 2024

This happened to me downloading from jeeves; unfortunately I was running curl -fsSL so all I got error-wise was:

curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Just wanted to note this here in case having a timestamp is useful (this error is from a few minutes before I posted this).

@jgallagher What URL would you have been downloading at the time?

@jclulow
Copy link
Collaborator

jclulow commented Apr 15, 2024

Huh, this is exciting!

2024/04/15 16:45:20 [error] 20162#1: *71970046 upstream prematurely closed connection while reading upstream, client: 66.117.152.2, server: buildomat.eng.oxide.computer, request: "GET /public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip HTTP/2.0", upstream: "http://172.31.100.193:4021/public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip", host: "buildomat.eng.oxide.computer"
16:45:19.646Z ERRO github-server: download failed: published file: owner oxidecomputer/omicron series rot-all version bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd name repo.zip: backend error: request or response body error: error reading a body from connection: end of file before message length reached
    bytes_expected = 1688983946
    bytes_transferred = 1084817164
    download = url
    hdr_x_forwarded_for = 66.117.152.2:61273
    local_addr = 0.0.0.0:4021
    method = GET
    msec = 245996
    offset = 0
    rate_mb = 4.205605238039445
    remote_addr = 172.31.43.126:36004
    req_id = ec1b9fd3-b25f-4e07-bb92-063af2084a49
    uri = /public/file/oxidecomputer/omicron/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip
16:45:01.287Z ERRO buildomat: download failed: published file: user 01FV089DQ9F11ETVWFXWW3GYAD series rot-all version bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd name repo.zip: interrupted on client side
    bytes_expected = 1688983946
    bytes_transferred = 1090458095
    download = s3
    hdr_x_forwarded_for = 44.227.183.26:43318
    local_addr = 0.0.0.0:9979
    method = GET
    msec = 227644
    offset = 0
    rate_mb = 4.568272464098826
    remote_addr = 172.31.43.126:57618
    req_id = 58e7f56e-38d7-4e08-a2da-29d8deb8fcfe
    uri = /0/public/file/gong-238580629/rot-all/bb7b1eb4b63833e7a8614b6947b3bdef5336cfbd/repo.zip

So the interruption appears to have been somewhere in here:

you -----> buildomat-web0 ----> buildomat0 ---------> buildomat-web0 -----> buildomat0 -------> s3
             nginx        buildomat-github-server        nginx            buildomat-server
                                               \~~~~~~~~~~~~~~~~~~~~~~~/
                                                     interruption

I have some things I can tweak, at least, to improve things here. Thanks for the heads up!

@labbott
Copy link

labbott commented Apr 16, 2024

Hit the same issue

laura@jeeves /staff/lab/madrid/laura-update-2024-04-16 $ OMICRON_COMMIT="69ed7ad871969912e44d02620430ed2e3e7c2fdd" /staff/lab/madrid/download-tuf-repo.sh 
Grabbing tuf repo artifacts (omicron@69ed7ad871969912e44d02620430ed2e3e7c2fdd)
  Downloading TUF manifest...
  Downloading TUF repo (resumably)...
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

@jclulow
Copy link
Collaborator

jclulow commented Apr 16, 2024

Were you able to resume the download with range requests?

@jgallagher
Copy link
Author

I was yesterday, yeah. Looks like Laura was using the same script I was:

  Downloading TUF repo (resumably)...

@labbott
Copy link

labbott commented Apr 16, 2024

Were you able to resume the download with range requests?

Yes, I ran the script again and it completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants