Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksums on files / reliable up/download #11138

Open
tobiasKaminsky opened this issue Sep 10, 2018 · 33 comments
Open

Checksums on files / reliable up/download #11138

tobiasKaminsky opened this issue Sep 10, 2018 · 33 comments
Labels

Comments

@tobiasKaminsky
Copy link
Member

tobiasKaminsky commented Sep 10, 2018

To have reliable up- and downloads, generating a checksum on server is needed.

Upload

  • client generates checksum
  • client uploads file with checksum
  • server checks checksum and stores it after successfull upload

Download

  • client retrieves file and checksum
  • client checks checksum

On backward compatibility:

Upload:

  • if no checksum is provided, e.g. legacy apps / 3rd party apps -> generate checksum on server

Download:

  • legacy clients can simply ignore checksum

Additional for Android Files we can check if files are already uploaded (despite on relying on file names).

(this was shortly discussed in NC15 planning meeting)


Additionally, the ability to request checksum of file via endpoint from #25949:

  • client detects that file with same name already exists
  • we do not know if it is the same file
  • client requests only for this file via new(?) endpoint a checksum
  • client compares it
  • only if this differs, it presents conflict resolution dialog
@nextcloud-bot
Copy link
Member

GitMate.io thinks possibly related issues are #56 (File Drop: Create confirmation / show file checksum), #3475 (question about file count), #6129 (When using the REST API to find out checksums of files, a 404 is returned.), #7629 (Create Folders when moving files), and #1371 (No file-uploading possible at all).

@jospoortvliet
Copy link
Member

I believe the clients checksums files, not sure if it syncs the checksums and/or if the server verifies them. But that's more meant to check if files are transferred correctly.

@tobiasKaminsky
Copy link
Member Author

This should be done on server side, as even if all clients would compute & upload a checksum, the webUI would be missing and then we cannot use this at all.

@tobiasKaminsky tobiasKaminsky changed the title Checksums on files Checksums on files / reliable up/download Oct 10, 2018
@tobiasKaminsky
Copy link
Member Author

@icewind1991 we talked about this on NC15 meeting.
Do you think this is feasible to get it still in NC15?

@MorrisJobke
Copy link
Member

This should be done on server side, as even if all clients would compute & upload a checksum, the webUI would be missing and then we cannot use this at all.

The big problem with this is that it is nearly impossible to generate the checksums for external storages. We can't do anything there and also updating them in a timely manner is nearly out of scope. For a pure "all data goes through NC" this would be no problem, but the external storages are a real problem.

@MorrisJobke
Copy link
Member

cc @rullzer @icewind1991 because we discussed quite a lot about this.

@tribut
Copy link
Contributor

tribut commented Jan 3, 2019

FYI the fact that NC does not have this apparently causes owncloud-client to warn when connecting to a nextcloud instance, which is a problem in Debian because there is no nextcloud-client (yet):

https://alioth-lists.debian.net/pipermail/pkg-owncloud-maintainers/2019-January/003498.html

@MorrisJobke MorrisJobke self-assigned this Jan 3, 2019
@MorrisJobke MorrisJobke modified the milestones: Nextcloud 16, Nextcloud 17 Mar 6, 2019
@skjnldsv skjnldsv added the 1. to develop Accepted and waiting to be taken care of label Jun 17, 2019
@rullzer rullzer modified the milestones: Nextcloud 17.0.4, Nextcloud 17.0.5 Mar 11, 2020
@swordfish90s
Copy link

Hi, can I +1 on this feature.

using: NC 22.2.3

My NC uses local storage, and was trying to :
rclone sync google-drive:folder nc:folder
and found that all the files sync'd again.
Running :
rclone hashsum md5 nc:folder
shows "Hash unsupported: hash type not supported"

The thread here suggests that the non-local storage presents a problem, however many of those storage platforms support presenting hashsums. Where this is true, a dynamic call could be made to retrieve it if not already stored in NC. The other suggestions of opportunistic hashsum calculation on next streaming seem reasonable.

It may be an unsupportable case where the storage is non-local AND the file is never streamed via NC. So document it as such, and have the feature available for everyone else.
In the case of files added to storages not via NC, then they will need occ files:scan anyway, so have an option to add without hashsum if the storage platform does not support it AND the user doesn't want to stream it in order to calculate it.
Default would be "do not stream to calculate hashsum if storage provider does not provide them", but you can opt in.

Best of both worlds.

Once this feature is implemented, it would be good to have a way of adding hashsums to existing files via occ.

Thanks, and looking forward to this feature in NC.

Rob

@jospoortvliet
Copy link
Member

For Nc 24 we're likely adding a hashing on the server (I know it's in the plan). But note that the client already does compute, store and check a hash.

I do think some users might have bigger expectations of these hashes than is realistic. We are likely only very rarely going to not sync a file because of the hash - I just wouldn't trust it completely. So it won't save much data transfer. Then the only thing it really does is, in theory, warn when bits are changed during data transfer or storage. It won't FIX those situations, or tell you what the right file is, and it's super unlikely these things happen. And IF they happen, there IS already a checksum on http(s) transfers, so you're actually not adding any more reliability. Maybe a good feeling and a bit more load on the server and client is all you get.

Bugs like the one mentioned, mistakes that cause wrong dates or 0 byte files - these wouldn't be fixed somehow by having these checksums. Not saying the have no benefits, but just saying it's not magic. Honestly I think it would be a good idea to update the topic on top with our actual GOAL: what are we trying to accomplish? Because as it's written now, it seems to be a double-check if a file transfer happened correctly. But the http protocol already does that, so that seems quite pointless.

@swordfish90s
Copy link

Hi @jospoortvliet ,

Thanks for the update.
Yes, I think the title is a bit off. As per your comments using a hash to confirm data integrity after transfer is not required. The real need is to do file comparison beyond the file size + date stamp + filename check.

There are many syncronisation tools (rclone being one of them) that are able to compare the hash to check the need for data transfer. In my scenario NC would need to support MD5 and SHA1 for greatest compatibility. For example Google Drive only supports MD5. So a direct hash comparison is not possible if only SHA1 is supported in NC in that case.

Note you mentioned the "client already does compute, store and check a hash"..... I am looking for the hash to be exposed via webdav.

Thanks, Rob.

@paroj
Copy link

paroj commented Apr 5, 2022

I think I just ran into this.

It won't FIX those situations, or tell you what the right file is, and it's super unlikely these things happen. And IF they happen, there IS already a checksum on http(s) transfers, so you're actually not adding any more reliability.

my use-case is updating the tags of my mp3s. Here, one often uses a "preserve timestamp" option not to mess with the "order by latest" view in various music players - so this specific metadata falls flat.

Furthermore, I removed some tags, which I assume means zeroing out some meta-data frame, so the file-size stays the same also.

Now I would love to have a prompt for a client-server data conflict, so I know that the data on the server is outdated.

I guess that there are more file-formats that are prone to this silent not-being-synced issue.

@luzfcb
Copy link

luzfcb commented Mar 2, 2024

I had the bug nextcloud/android#11974 (The version of the Android App with possible fixes is not yet available when I sent this comment)

For whatever reason, the Android app or Server conflict resolver did something wrong and this is what I got in several of my photos.

It basically replaced the original photo on my smartphone and on the server with a black photo

image

@GlassedSilver
Copy link

I had the bug nextcloud/android#11974 (The version of the Android App with possible fixes is not yet available when I sent this comment)

For whatever reason, the Android app or Server conflict resolver did something wrong and this is what I got in several of my photos.

It basically replaced the original photo on my smartphone and on the server with a black photo

image

Let me guess, are they 0 Bytes? :)

Happened to a good chunk of mine for some months until I finally clued together what possibly causes this. I had never expected NC to be the culprit, an app that's massively adopted, the base for many organizations' own clouds using NC as a backend with obvious need for mobile workflows.

I completely distrusted my phone's NAND over NC for a while.

The fact I need to do manual conflict resolving on unchanged files, that NC zeroes local files before the server has a proper upload...

I'm puzzled by all of this.

@Siress
Copy link

Siress commented Mar 3, 2024

See my feedback here: nextcloud/android#11974 (comment)

"Performance issues are tolerable; data loss is not."
It's a shame, because NextCloud looked like a really beautiful, mature technology - but it lacks essential performance for a cloud app.

@MelonPie
Copy link

MelonPie commented May 1, 2024

Just had an incident because of the lacking integrity check. If one uses NC to store many files this feature is essential.

@JulienFS
Copy link

I'm jumping on the train. I ran some JPEG check script on my NC files and discovered that few % of my uploaded images on a specific period of time that lasted 7 months were corrupted with bunch of 0s. Not fully blank but some zeroed segments, sometimes as big as ~500 bytes. JPEG will be "heal-able" but for some other files it's going to be tough to put the right placeholder in the blank spot (PNGs are tough, PDFs are hell)

I think that the focus on synchronization left us with a risky upload tool. The explanations that "migration would be complicated for big systems" when suggesting to add hashes is a bit infuriating. Feature flag with opt-in, major release with breaking changes, that's not something new to the industry. Plus it somehow closes the door for volunteer work on this matter.

But I'm open to disruptive approaches. Maybe a nextcloud app to gather client's hashes on a voluntary basis and that can check that the files that land on the server are OK is good enough, without having to rely on core synchronous mechanism.

@GlassedSilver
Copy link

The explanations that "migration would be complicated for big systems"

Not nearly as complicated as cleaning up after potentially months of hundreds or thousands of people relying on sync working when it was silently corrupting files...

Unrecoverably so because folks trusted Nextcloud instances to run on well-backed up drives and their phones having little storage. Too bad good data never reached Nextcloud under certain circumstances.

Yikes... PR nightmare for EVERYONE involved. With all due respect.

@JulienFS
Copy link

I did a first exploration of the code server-side. NC heavily relies on Sabre. There are many app/plugins here and there to tune the default behavior. This is going to take some time to have a global overview of everything related to upload and set a plan (chunked upload, bulk upload...).
However this is already obvious that the upload process is complex. Even with a careful review here and now, guaranteeing that the process won't mess with file content, there will still be a lot of room for an application error that could result in a loss of data. The hook nature of the framework's API makes me think that any "crappy" app installed could mess with file upload. Checksums at the right level should offer a guarantee that whatever happens on the application level, data integrity would be preserved. Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

@rugk
Copy link

rugk commented May 18, 2024

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

@GlassedSilver
Copy link

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

The point of this mechanism is to ensure the transmission process is graceful and error-free. Whatever happens afterward, such as automated or manual editing of media stored within Nextcloud should be a completely different aspect. What we don't want however is during the process of transmission some app interfering with the data as it's still in transmission. I think that was @JulienFS's point. (correct me if I misunderstood you)

@JulienFS
Copy link

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

I would advocate for the decision to be left to the final user. My reasoning is : bringing hashes to the core upload mechanism might require a costly major rework, doing it opportunistically with lightweight plugin based tweaks might not offer strong enough guarantees for integrity. Having a side mechanism that checks end-to-end integrity might be good enough. In this regard I would say : I'd better like compressing/whatever apps to have to integrate to the integrity app than let them potentially mess up with the integrity of my data. In the same reasoning, I'd rather like to choose between integrity and post-upload modification than not having a reliable integrity/sync check.

Obviously there's not only integrity involved in using hashes : silent file modification or manual efficient resync are also to consider. At this point I'm trying to figure out if there's any easy path to bring an opt-in solution for people who consider these problems more than eventual post-upload modifications by other apps.

I'm not a NC developer, simply a FOSS enthusiast which happens to also be a pro dev. I'm willing to put some effort to solve these things but it might be a big piece of work and I'm unsure if I'll have the resources to achieve it. Finding a workaround here and now would at least offer some of us enough time to build or wait for a core feature without having to leave NC.

Ideally the mecanism should rely on a part of the API whose behavior is not customizable, as we don't want any bad app to be able to mess with it.

Well what about apps that e.g. want to offer a feature like converting/compressing images upon upload? Or removing metadata or similar features? AFAIK extensibility should allow this and you likely cannot prevent them to do anything bad, you need to trust the apps installed.

The point of this mechanism is to ensure the transmission process is graceful and error-free. Whatever happens afterward, such as automated or manual editing of media stored within Nextcloud should be a completely different aspect. What we don't want however is during the process of transmission some app interfering with the data as it's still in transmission. I think that was @JulienFS's point. (correct me if I misunderstood you)

That would be the "end game". Obviously keeping NC's modularity while fixing hash related problems is the ideal goal.

I'll keep investigatin, maybe adding hashes it not that a big deal, but I prefer to be careful as I'm not familiar yet with the code base.

PS : if anyone landing here has some product management skills, feel free to start aggregating things related to the lack of hashes and get me in the dev loop.

@JulienFS
Copy link

JulienFS commented May 20, 2024

Update from the rabbit hole.
There might be something to do with hashes as there's plenty of event to plug to on the Sabre side. At least I feel confident in the possibility to intercept all file writes with enough context. There's a priority mechanism for the event handling which would allow an app to intercept writes before any other app, considering that other apps are "educated" and don't register for a low priority if not needed. I still need to check if we can monitor registered apps to see if anyone has a higher priority, that would be a great way to at least notice if another app could mess our things up.

I found something disturbing during my little exploration : the file creation and update part on the Sabre side. Basically it uses PHP's php://input stream to get the file body, which is fine. But it uses a simple file_put_contents without even checking the result. Maybe worse than that, it looks that for update of an existing file it also uses plain file_put_contents on the existing file. I have no clue of what would happen is the input file stream was ever to be interrupted, especially since there's no error handling here. But even with error handling we would still have overwritten the existing file.
https://github.com/sabre-io/dav/blob/7a736b73bea7040a7b6535fd92ff6de9981e5190/lib/DAV/FS/File.php#L25C9-L25C26
https://github.com/sabre-io/dav/blob/7a736b73bea7040a7b6535fd92ff6de9981e5190/lib/DAV/FS/Directory.php#L46C9-L46C26
ERRATUM : see following message

At this point I can't be really sure that there's no extra mechanism plugged somehow that would make the situation better, so lets not jump to conclusion. However I'm still trying to figure out how some zeros ended in my files, and a stream interruption on an update combined with a zero-filled write buffer would match the pattern.

EDIT : I forgot to mention that the bulk upload feature is not leveraging the same code as Sabre's PUT handler. It's re-implementing on top of some of it. For instance and to my understanding files created with bulk upload don't seem to trigger Sabre events, but NC events instead.

@JulienFS
Copy link

After digging a little bit more : it's not Sabre's default file handling that is being used. I missed the CachingTree passed to the default implementation constructor, which is the starting point for NC's style file handling.

public function put($data) {

public function createFile($name, $data = null) {

Glad to see that the implementation is more complex than a simple file_put_contents.
Few things to note at this point :

  • there is indeed server side error handling : a decent client implementation should be well informed if something went wrong and act accordingly. For example there is a check on the effective file size
  • there is a hash mechanism, it the client sends the X-HASH header, it should receive a X-Hash-* like header in the responses (that it can check against)
  • it still looks like we are writing straight on the existing file, but I need to check the whole path generation thing plus the versionning app before reaching any conclusion

I still need to dig deeper.

  • Bulk upload needs to be checked to see if it uses the same files as the regular DAV endpoint. At least for what I've seen the hash header is not available for bulk upload.
  • Also the chunked upload also needs some exploration : clients split big files and send them as chunks. This is not standard HTTP chunked encoding, this is an application level chunking introduced to ease the resume of interrupted upload of large file (from what I've read).

At this point it starts to look clear to me that a well implemented client should be able to do reliable upload, especially since there's an optional hash check mechanism. I also think that reliability could be increased by not writing to the final file directly, this way we would preserve actual content and not rely on a reupload by the client to fix the file (but this part still needs investigation).
There's a chance that my data corruption was induced by a combination of a bad client(outdated), bad network, and a write final data before sanity checks. The bad client is a big problem that is going to be hard to solve with only server-side work, as even an end to end check would require the client end to be sane. I'm starting to think, in addition of the rest, to some sort of client user agent header check on the server side. We should probably offer an option to block outdated clients or at least ring an alarm. For example the client shipped on ubuntu 22.04 (which is not that old) is few years old and have some bad behavior.

When it comes to faster resync, it could be handled by supplying a client hash to an extra endpoint/http method . It has been said previously that "(you) wouldn't trust a file to exist based on a hash" but I would personally trust a modern secure hash. Obviously we would be drifting a bit from standard DAV, but as long as it is optional it shouldn't be a problem (and I think chunked transfers is not that standard, same for bulk upload). That would be a reasonable tradeoff between network usage and CPU usage if we don't store the hash. Only people with a backend storage that bills more for read than write might have a problem, but they could simply not opt-in (or opt out).
Maybe we would have a problem with HTTP timeouts as checksums for big files can take a lot of time to compute (but there are solutions to that). One thing is sure : I would trust a "live" hash but that would be harder for a stored hash, as it would bring more complexity and one more risk of silent data corruption in case of hash desync.

We could also send a hash from the client, in addition to the write final data after sanity check change: this way we could ensure data integrity before writing the final content to the final path. I know that there's already plenty of CRC/checksum mechanism on the underlying mechanisms (TCP has been cited) but here we could safeguard against application level errors.

@JulienFS
Copy link

JulienFS commented May 21, 2024

It looks like bulk upload is not using the DAV file handling but is using the native OC file handling. Here is where I landed by trying to follow the epic thread of hooks and inherited classes and autoloaded stuff :

public function file_put_contents($path, $data) {

So the bulk upload doesn't have some handy stuff from NC's Sabre implementation. But it has its own handy stuff. There's an 'in stream' checksum that happens for each embedded file. The problem is that it operates on the raw stream instead of the actual file data, that is fetched from the stream later. If something gets messed up in the actual fetching of data the previous checksum won't detect it. This is better than nothing as it somehow safeguard against dramatic stream handling problem but it could be enhanced.

The chunked upload is on another path than the regular one and doesn't have the checksum compute/check option. I might be wrong but it looks like it blindly pushes the eventual 'oc-checksum' header to the file info cache without checking it. Thus we could have some checksum already sitting somewhere that could be simply wrong.

I don't really know why there are filesystem related classes for DAV/Sabre and other FS related classes in the OwnCloud lib, and why one is not using the other.

At some point I was affraid that NC drifted from OC and that we ended with a mess, but things look pretty similar upstream. That being said, maybe OC has shifted its resources to it's "infinite scale" backend...

So the first thing would be to ensure that we have a hash coming from the client for the four upload scenarios :

  • bulk upload : the hash would need to be per embedded file, nice we already have it, it's md5 but its probably good enough as hostile hash forging is not a major threat here
  • regular upload : probably that a header would fit, there's already something computed, we need an extra header with the expected checksum
  • chunked upload : we could probably pass the whole file checksum in header for every chunk, so we would be sure that's it's there when reassembling the file
  • browser upload : it is in fact the same as chunked upload, at least I think so, but the browser is the client

For all these scenarios, we should perform a last minute checksum before writing anything to the final destination. It involves writing aside from the final destination during the upload. It is already the case for chunked upload, obviously. In this later case we would need to read the reassembled file twice : once to get the checksum, then another time to write to the final destination.

Ideally in all these scenarios the final compute checksum should be return to the client, a header would fit. That would be a nice way to check that things didn't go crazy.

Now to be extra safe we would probably need an extra round-trip from the client to the server. We could check that was is written on disk is actually what should be written on disk. It might sound paranoiac to do this extra check but with the level on inherited filesystem class involved during a simple write operation that might be an option.

The endpoint in use for this extra check could also be used for periodical/manually triggered sync check. It could also serve when bootstrapping a synchronized directory to detect already existing files remotely.

Considering the mess that filesystem handling is at this point I would strongly advise against server-side hash storing. Maybe there's opportunistic reliable hash storing with some object storage backend but otherwise it would be way too hard to guarantee that any modification to a file event through vanilla NC would properly update the hash.

Most likely, before any hash storage to be implemented, we should consider refactoring the whole filesystem handling. Anyway "live" hashes should be good enough.

I'm going to take some times to think about all of this then I'm probably going to start writing some code if it fits my schedule. If anyone has some suggestion feel free to share.

@sk1806
Copy link

sk1806 commented Aug 1, 2024

I don't necessarily nextcloud doing the safety checks, I just want to be able to access the checksum from the server, so I can write my own scripts that deal with checksum checks in the way I want. Is there no way at all to retrieve the checksum at the moment (without reading the file into memory and computing it, which would be way too heavy) ? It is really frustrating not being able to implement safety checks, I don't understand how the checksum is not already stored, it feels like very basic information that any storage solution should have. Nextcloud always screws up a small percentage of uploads/downloads in a way that it doesn't clean itself up, which makes it pretty much unusable.

@JulienFS
Copy link

JulienFS commented Oct 9, 2024

@sk1806 storing hashes would bring another class of problem though. What if the hash stored doesn't match the actual file ? This is especially problematic since nextcloud can deal with many storage backends. And to be fair with the way hashes are actually dealt with we could end up storing header-provided hashes without even checking that the file was actually written properly...

FTR I still plan to work on this hash thing but my job and my family was very demanding for the past few month. I really hope that I can get myself back on that topic by the end of the year. At least at this point I have a good overview of the code base especially related to file upload. If any dev passing by has some throughput please reach out I can surely put someone on a fast track.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests