Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open for Discussion: Folder Sending #137

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ryanmcgrath
Copy link
Contributor

I was looking at this after some time away, and wanted to get (proper) folder sending working. Opening this PR as it's easier to have a discussion with code being linked, but this isn't truly a PR in the correct sense.

Anyway.

This change has a send_zip_file method in transfer.rs, which takes an already zipped file and sends it via the PeerMessage directory mode. The actual mechanics of zipping up a file are something I need to extract from my project - and I'm happy to include this in that PR, once I can do so - but this method should be testable by just redirecting send_file_or_folder to send_zip for a .zip.

So ultimately some questions:

  1. This actually works, as I use it and files/folders are transferred without issue. That said, I'm sure I could be missing something about how the protocol works here, so I wanted to get more seasoned eyes on it. I do compute a hash of this elsewhere that I could pass in to send_zip_file if the checking logic currently in the bottom of send_folder is needed.

  2. My work that I need to extract and add on to here mimics the Python implementation by writing to a temporary spooled file, which spills to disk after a default of ~10MB. I assume that'd be useful to include as well?

  3. This is completely tangential, but is there any chance this repo could be migrated from async_std to tokio? The fractured async ecosystem in Rust is already a pain, and trying to look at ways to do something like async-zip-file-writing is a headache when the bulk of the ecosystem is in Tokio.

Or support both, I suppose, but that seems like more hassle than it'd be worth.

@codecov

This comment has been minimized.

@piegamesde
Copy link
Member

I initially didn't want to implement the part of the protocol with the zip files, because it's ugly and has a lot of downsides. Most notably, not everything can be zipped and building large zip files is really annoying (especially if the transfer fails and you have to rebuild it). Instead, I hacked something using tarballs sent as files that at least support all files and can be streamed from/to disk. Of course, this has its downsides too. I am open to a zip based solution though. Maybe we can keep both implementations in a worst best of both worlds approach?

  • Try to send the folder as zip file
  • If the folder is too large and zipping would take too long, or the folder contains files that cannot be zipped, fall back to sending a tarball
  • Maybe add some CLI flags to manually override the behavior

If we instead go full in on the zip file approach, I'd like to see the following improvements (can be done in follow-up PRs):

  • As you said, try to zip in RAM first and spill to disk when needed
  • If zipping takes long, don't delete the file on a failed transfer, so that the retry won't have to zip it again

This is completely tangential, but is there any chance this repo could be migrated from async_std to tokio?

Not likely, sorry. If we choose Tokio, we indeed have more dependency choices, but we also force it upon all dependents. Async-std is a lot easier there, because it spins up its own runtime. I'd be willing to work towards more executor-agnostic code though. This probably won't be possible for the whole transfer part, but I'm positive we could do it for the core: The Client-Client and Client-Server protocols only need WebSockets, which could be replaced through dependency injection to remove the async-std dependency.


Heads up: I have started factoring out the transfer protocol into versioned submodules in preparation of v2. You will probably have to rebase onto 6359efc or some variation of it at some point. Sorry for the inconvenience.

@ryanmcgrath
Copy link
Contributor Author

I initially didn't want to implement the part of the protocol with the zip files, because it's ugly and has a lot of downsides.

Yeah, I can see the logic against it in general, but it's how other clients work and it'd be nice to interop with them properly even if it's viewed as a "legacy" routine.

Maybe we can keep both implementations in a worst best of both worlds approach?

This makes the most sense to me, yeah.

If the folder is too large and zipping would take too long, or the folder contains files that cannot be zipped, fall back to sending a tarball
Maybe add some CLI flags to manually override the behavior

Logic makes sense though I'll defer to you on what those size limits and/or flags should be, since I wager you have more insight into what feels more natural for a wormhole client anyway.

As you said, try to zip in RAM first and spill to disk when needed
If zipping takes long, don't delete the file on a failed transfer, so that the retry won't have to zip it again

This is mostly automatic on both counts due to how I built it (using tempfile - just waiting on this). So we should be good here barring anything that comes up.

Async-std is a lot easier there, because it spins up its own runtime.

I'm definitely open to missing something here, but I don't see anything in here that couldn't be done with tokio. Can you point me to what's easier with async-std?

but we also force it upon all dependents.

I think my point is actually illustrated here... by using the less de-facto async executor, you're making (forcing) it harder for things to interop well.

Regardless, I'm not looking to start a religious war about async executors - more capable interop would probably work too. If I still feel like tokio is the better choice I guess I can fork and replace down the road.

Heads up: I have started factoring out the transfer protocol into versioned submodules in preparation of v2. You will probably have to rebase onto 6359efc or some variation of it at some point. Sorry for the inconvenience.

Heh, yeah, I didn't see that PR until it was too late. I don't think rebasing should be too bad though, so I'll probably just keep adding to this and then rebase on that whenever it's merged. Can push out my app in the meantime and just peg it to my fork and then update it when your stuff is good to go.

One last question: the hash building in the current send_folder method, and subsequent check, is missing in my current send_zip_file method. This feels like an oversight on my part (i.e the existing approach is likely there for a reason) yet transfers work just fine. Am I overthinking this or is there a modification needed there?

Thanks again!

@piegamesde
Copy link
Member

Yeah, I can see the logic against it in general, but it's how other clients work and it'd be nice to interop with them properly even if it's viewed as a "legacy" routine.

Just to be clear: the current hack is compatible. It only reduced user experience, and not interoperability :D

Can you point me to what's easier with async-std?

Yes. It's easier to use an async-std library in a tokio project than the other way around.

the hash building in the current send_folder method, and subsequent check, is missing in my current send_zip_file method. This feels like an oversight on my part (i.e the existing approach is likely there for a reason) yet transfers work just fine

The hash sum is part of my hack to allow streaming sending: the problem is that we need to know the size of the compressed archive beforehand, because of a protocol quirk. To work around having to build a temporary archive file, I simply archive the file twice: once to see how big it will get, and the second time while actually sending it. The hash is only there to detect any potential file system smear (i.e. concurrent file modifications) that would invalidate the process.

@ryanmcgrath
Copy link
Contributor Author

Just to be clear: the current hack is compatible. It only reduced user experience, and not interoperability :D

Heh, we'll have to agree to disagree there. Regardless, I appreciate you considering this PR.

The hash sum is part of my hack to allow streaming sending: the problem is that we need to know the size of the compressed archive beforehand, because of a protocol quirk. To work around having to build a temporary archive file, I simply archive the file twice: once to see how big it will get, and the second time while actually sending it. The hash is only there to detect any potential file system smear (i.e. concurrent file modifications) that would invalidate the process.

Ah, got it - this is helpful so I appreciate the clarification.

It looks like my changes to tempfile were merged and released, so I'll try to find time this weekend to bump this PR with updates. Do you know whenabouts you're looking to merge your other PR?

@piegamesde
Copy link
Member

Heh, we'll have to agree to disagree there

I'm not sure, is the current code incompatible with other implementations in some way?

Do you know whenabouts you're looking to merge your other PR?

Not pretty, but I've "merged" (pushed) the relevant changes onto master.

@ryanmcgrath
Copy link
Contributor Author

Mmm, compatible and interoperable are two different takes here - I'm moreso saying the latter. It's of course compatible insofar as transferring a file works.

If it doesn't unzip on the other end, then no, I wouldn't necessarily consider it interoperable - I.e you're not sending a folder in the sense of how both sides see it. There's then additional work for another user to go through.

I think there's a comment in the codebase that even explains that it doesn't unpack it on the other side, but I'm mobile so can't confirm at the moment.

I can see your logic for a tarball ultimately being better, I just value working with the existing expectations more, hence the agree to disagree. 'S not meant to be a dig.

@piegamesde
Copy link
Member

I'm increasingly in favor of throwing my crappy tar ball hack out of the window, and also send the zip files without being too smart about it. Just get the feature working for compatibility, and focus on transfer-v2 for all the real improvements.

@Oppen
Copy link

Oppen commented May 4, 2024

Will transfer-v2 remove the whole prepack into archives or something? I feel streaming files makes more sense here, both in terms of memory, storage and CPU usage. In terms of bandwidth it probably makes little difference.

That said, I would expect the zip to be a Write, so we could probably do it while streaming it, right?
Re: hashing: you could compute it during the send itself (with some kind of tee behavior) and send it after the archive is done. You lose the concurrent filesystem smearing check to a point, but the protection was weak in the first place (only needed because you read things twice), but you still get to send some integrity of what you sent vs what they received.

@meejah
Copy link
Member

meejah commented May 5, 2024

Will transfer-v2 remove the whole prepack into archives or something? I feel streaming files makes more sense here, both in terms of memory, storage and CPU usage. In terms of bandwidth it probably makes little difference.

Yes.

The future here is Dilated File Transfer which doesn't depend on any archive format but instead uses the protocol itself to bundle files into cohesive offers in any order/grouping the client software decides. It is not worth time tweaking the v1 transfer, IMO -- either implementations support it, or they don't.

I certainly agree that ZIP isn't an ideal format here -- but at least it has wide support, and you can stream with it (see recent Python for example). While more or different performance characteristics could be gotten from tweaking the exact archive format and options, any effort I have in this direction will be spent on v2 / next-gen / Dilated file transfer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants