Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report: Website and other other uploads hang while an upload is finalizing #660

Open
jwdoll opened this issue Oct 29, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@jwdoll
Copy link

jwdoll commented Oct 29, 2024

👟 Reproduction steps

To trigger in one session, upload two files. Say 1 500MB zip and 1 5gb zip. When the 500MB zip is finished uploading and "finalizing", the other download errors, and resumes after the first is finished finalizing. Same result across a variety of chunks sizes.

👍 Expected behavior

Finalizing an upload should not block the web client nor interfere with ongoing uploads (I presume it blocks and overflows a buffer). Not a web programmer, but seems like something isn't async that should be.

👎 Actual Behavior

While an upload is finalizing, the site hangs for other clients, and when it finishes, you get an internal server error. It does not appear to interrupt ongoing downloads, fortunately.

📜 Logs

No response

@jwdoll jwdoll added the bug Something isn't working label Oct 29, 2024
@stonith404
Copy link
Owner

Are there any errors in the logs on the server?

@jwdoll
Copy link
Author

jwdoll commented Oct 29, 2024

Here's what the frontend looks like as it happens. I'll try to find the backend logs today.

image

@jwdoll
Copy link
Author

jwdoll commented Oct 29, 2024

Hmm, sorry, but where's the backend log file? Not seeing it in the file system nor in the db.

Edit - oh, within Docker.

@jwdoll
Copy link
Author

jwdoll commented Oct 29, 2024

The only thing I see in the docker logs are premature closes, which appear to be from clients navigating away without finishing downloads.

@stonith404
Copy link
Owner

OK, thanks. Are you sure that this happens only if you upload files simultaneously, like you described? Could you share the response body of the errored request? You can get it from the network tab of the dev tools.

@jwdoll
Copy link
Author

jwdoll commented Oct 30, 2024

Yes, I'm sure. Although if you were uploading a single file to two separate shares, either from two tabs or two computers, the second upload to finish would error while the first finalizes.

I didn't capture the 400 error that happens after the 403 in the first screenshot. They seem to always come in pairs.

{"message":"Cannot GET /api/shares/cyMTgzM/files?id=66133106-c592-446e-b1cf-ad2d50312b11&name=Tiffs.zip&chunkIndex=13&totalChunks=137","error":"Not Found","statusCode":404}

{"message":"Cannot GET /api/shares/cyMTgzM/files?id=66133106-c592-446e-b1cf-ad2d50312b11&name=Tiffs.zip&chunkIndex=0&totalChunks=137","error":"Not Found","statusCode":404}

Here is the 403:
image

And the 400:
image

@stonith404
Copy link
Owner

Just to confirm, am I understanding correctly: if you upload the same file to two shares simultaneously, the second upload fails once the first one completes?

Because I'm a bit confused from your reproduction steps:

To trigger in one session, upload two files. Say 1 500MB zip and 1 5gb zip. When the 500MB zip is finished uploading and "finalizing", the other download errors, and resumes after the first is finished finalizing. Same result across a variety of chunks sizes.

If I'm wrong please explain the reproduction steps step-by-step or create a screen recording.

@jwdoll
Copy link
Author

jwdoll commented Oct 31, 2024

It doesn't need to be the same file. The point about two shares is that whenever the server is finalizing a download of sufficient size from any source, any other uploads block/error.

There may be some more relevant errors in the logs than I realized. Seems like deadlock on the sqlite db called through Prisma, though I don't really understand Node's threading model, or lack thereof. I will try to set up an environment and step through.

Reproduction: Upload 1 1GB file and 1 5GB file. The 1GB file finishes first, and while finalizing produces an error in the second upload/hangs the site.

@jwdoll jwdoll closed this as completed Oct 31, 2024
@jwdoll jwdoll reopened this Oct 31, 2024
@jwdoll
Copy link
Author

jwdoll commented Nov 1, 2024

Small update - The error isn't presenting with my mac as the host with the backend running in the VS Code debugger. It also doesn't present when I upload to my production host (docker, Ubuntu, HDDs) over wifi. That's leading me to think there is a synchronous IO operation happening when the first file upload completes that blocks long enough on a slow HDD for a fast network connection to overflow a buffer with the second file. Renaming from tmp_chunk causing a physical move? Calling .size requires a re-count? I'll keep experimenting and trying to repro on a different host (especially in case it has something to do with ZFS).

Thanks for reading. Please let me know if you have any guidance as I'm highly unfamiliar with node.js and the rest of the stack.

@stonith404
Copy link
Owner

Thank you for sharing this detailed information. I just tested the upload on my Mac using an external HDD (formatted as APFS, not ZFS), and I didn’t encounter the error. Are you able to mount another HDD or SSD to your server? If so, you could try setting the Pingvin Share data folder to the mounted drive. This way, we could test if the hard drive might be causing the issue.

You also mentioned an error related to an SQLite deadlock. Could you share more details about that error?

I don’t have much experience with disks, so I’m not sure if a specific operation could cause issues on slower HDDs. But if we narrow down the problem and find that it only happens on slow HDDs, I’ll read into this topic.

@jwdoll
Copy link
Author

jwdoll commented Nov 11, 2024

Ok, I'm back after an embarrassing number of hours figuring out how to get docker to run local code and change DATA_DIRECTORY and DATABASE_URL.

Has the same idea and tested a slow USB hard drive on the Mac and it didn't reproduce the bug.

What I could see on the Ubuntu host was prisma throwing timeout errors (is its default 5s?) while performing the zip, which I believe caused the first 403 error. So, the frontend tried sending chunk 0 again, which resulted in the 500 error, but an "Unexpected chunk received" error to be returned, and the expectedChunkIndex payload got it to send the right one next, which successfully resumed the upload.

I tried changing docker-compose.yml and moving the pingvin-share.db to a separate drive from the share data, and that resolved the issue. I think the issue was ZFS batching writes and not allowing reads from pingvin-share.db until after prisma times out.

It also seems like createZip could be problematic since the archiver calls are synchronous. I forgot to check whether the UI loading issue is also fixed while finalizing, as in is the server event loop blocked, or are IO's to pingvin-share.db just left awaiting until writes are finished? Again, don't really understand node that well, but something seems funny about createZip.

I guess the question to ask is if you want pingvin share to support all IO behavior. ZFS isn't super common, but it's a first class citizen in Ubuntu. I'm sure there are others with strange behaviors. Plus whatever is used on cloud. Interested to hear your thoughts.

By the way, another thing that was confusing was after updating the production repo to the newest release, I wasn't able to login over LAN, but still was over the internet. Turns out it's the secure cookies feature blocking access because the internet URL is HTTPS, but lan IPs are not. To workaround, on my fork I added an option to the config menu to require https connections (maybe I should just say require https for a cookie?). I guess not technically an issue since if I were more competent I could get LAN addresses with certificates.

@stonith404
Copy link
Owner

Apologies for the delayed response, your message got lost somehow.

I don’t have much experience with different file systems, so I’m not sure I can provide a solid answer about the issue.

That said, your theory seems plausible. If createZip somehow blocks writes to the file system, it causes Prisma to time out. Could you test this behavior with two separate ZFS partitions? For example, you could place the database on one partition and the shares folder on the other. If this resolves the issue, it would suggest that ZFS blocks reads while createZip is running.

Like I said, I’m not very familiar with file systems, so this suggestion might not make sense. Let me know what you think.

As for the secure cookies issue, this should be fixed in the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants