-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I suggest doing speed/size comparison with other dedup. solutions #114
Comments
Hey @safinaskar, thanks for your suggestion. Did you document the steps for reproducing your benchmark results? |
Yes, everything is in that borg bug report. First you need to create files 00, 01, etc. Here are instructions: borgbackup/borg#7674 (comment) . But, of course, it is okay to use some other realistic data instead. Keep in mind that I use nearly 10 similar snapshots of 10 GiB data. Or I can simply send them to you (I just understood that I can very easily re-generate them.) Then run runner/benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1 (from borgbackup/borg#7674 (comment) ). Keep in mind that runner uses More details on reproducing are in that bug report ( borgbackup/borg#7674 ). If you have any questions, ask, I will happily answer.
In my terminology "compression" (a. k. a. "store") is all steps needed to place source file to "repo". It may include chunking, deduplicating and actual compression. For example,
Typical output from runner/benchmark looks like this: {"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538} "size" is size, "total_compression" is "total ctime" (in microseconds), last element of "compression" is "last ctime" (in microseconds) I don't know whether puzzlefs can add some file to existing image. If this is not possible (i. e. if the only way to create puzzlefs image is to create everything at once), then "last ctime" has no sense for puzzlefs. Also, if we process separate files in parallel when creating puzzlefs image, then "total ctime" doesn't have direct puzzlefs counterpart, too. (But comparing puzzlefs image creating time with "total ctime" of other deduppers is still good task, it is possible then even in this situation other deduppers still will be faster in creating of repo.) Also note that whole my benchmark is biased to my particular original problem. I was in search of dedupper for my "docker, but for VMs", I call it Jocker. Jocker always stores and extracts snapshots one-by-one. Also the blobs themselves are snapshots of single VM. This means that offset of particular file inside snapshot doesn't change. And thus CDC doesn't have any advantage over fixed sized chunking. No surprise then that my dedupper called azwyon with fixed sized chunking "won". When benchmarking I used instructions for stabilizing benchmark results from https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux and https://llvm.org/docs/Benchmarking.html . Some other random notes:
|
Thanks for the detailed reply! A quick note: I did try to use rayon to parallelize puzzlefs build, but the single-threaded version was faster. I didn't save the test results, but I still have the branch: #84 |
I suggest reading current version of azwyon: https://paste.gg/p/anonymous/3c43cb97f37440839d19392ca7f8dff8 . Assume public domain. It is nearly 150 lines of Rust. It differs from previous versions: I switched to Here are my comments on your code:
As well as I understand, Also: when writing azwyon, I studied borg source code in attempt to understand why it so fast. I borrowed some ideas. For example, borg memorizes hashes of sequences of zero bytes. Also, borg detects holes in input files (I didn't implement this). And borg drops input file from OS cache (sometimes this boosts performance, see borgbackup/borg#7674 for details). See borgbackup/borg#7674 for other ideas |
Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for |
rayon authors just acknowledged that |
I've looked at both casync and borg, here are some differences:
|
@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?" So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task. But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver |
@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ . Notable phrase:
|
Yes, there is this possibility, also be aware I've already submitted two versions of PuzzleFS to the LKML: here's the first one and here's the second one. The latest version is avaible on github, I have not sent it to the mailing list because I expect some changes to the filesystem abstractions, since there were disagreements on the proposed APIs. You can read the entire thread here. If you want to take a look at the latest PuzzleFS version, I've compiled a list of resources in the PuzzleFS page of the Rust for Linux project, where you will find the latest PuzzleFS branch, as well as other resources such as the LWN article on PuzzleFS and my presentation at Open Source Summit Europe.
There are other container solutions which use EROFS, such as composefs and nydus. EROFS doesn't work for the problem that we are trying to solve, i.e. deduplicate the data using a content defined chunking algorithm so we can store multiple versions of the same image as efficiently as possible. EROFS uses deduplication, but only inside a single image, whereas we want to deduplicate multiple versions of the same image. With the help of content defined chunking, we can essentially only store the deltas between multiple versions of the same image instead of storing them separately, which is how archive formats do it (e.g. tar, squashfs, erofs). We want to redesign the container image format such that a more efficient representation based on variable sized chunks is possible. You can see the results in my PuzzleFS presentation, in the Steps:
I'm aware it will be difficult to convince the kernel maintainers to merge PuzzleFS, they've already complained that they get too many container filesystems submitted to them. Time will tell whether they consider PuzzleFS to be worthy of upstreaming.
As I've mentioned above, I've already submitted multiple versions of the PuzzleFS driver, I've presented it to multiple conferences, so at least I have brought awareness to the PuzzleFS project. And other people have written articles about it, such as Phoronix and LWN. Wedson Almeida is working on his own filesytem in Rust, tarfs, so I'm not the only one trying to upstream new filesystems.
I think it's worth exploring what a new filesystem implemented in Rust can achieve, since kernel maintainers sometimes argue that they want to see new things implemented in Rust rather than having existing implementations duplicated in both C and Rust. Of course, it's understandable they don't want multiple filesystems which solve the same container image problem, so pushback is to be expected. @safinaskar thanks for presenting your concerns, we definitely have to make a strong case for PuzzleFS if we ever want it to be included in the Linux kernel. |
Hi @ariel-miculas , I'm not sure why you just ignore all my reply in the mailing list. First, EROFS supports And encoded EROFS already supports variable-sized chunks + CDC. I'm not sure if you correctly parse all EROFS current features. [1] https://hackmd.io/@cyphar/ociv2-brainstorm
I think some experienced file system developers asked for an ext2 filesystem implementation first. |
Hey @hsiangkao
Apologies for making you feel unheard, rest assured it is not my intention to ignore your feedback. Would you kindly point me to the messages that you've sent and I didn't reply to, so I can understand how I might have missed them? Please also note that the open-source projects sponsored by Cisco are interested in integrating erofs, e.g. stacker [8]. And if we can implement the features of PuzzleFS using EROFS instead of a custom filesystem representation, it would save us the headache of upstreaming another container-related FS into the Linux Kernel, which is no small feat. At some point I would also like to see a reimplementation of EROFS in Rust, which I see there's already interest for [9]. That's because one of our design goals with PuzzleFS is memory safety (this is what led to the decision of implementing it in Rust).
I did look at Nydus and I was hoping to see the docs for
It is quite possible I don't understand all the EROFS features. I was under the impression that the erofs CDC only works with a block device. Could you help me with some documentation/examples about referencing external blobs? I can't tell how this works based on the erofs documentation [4]. What PuzzleFS does is that it stores reference to external files in its metadata (the filenames are a sha256 digest) and then it recreates the original filesystem by reassembling the multiple chunks that a file is composed of. If I'm not mistaken, composefs [5] uses a feature of overlayfs for redirecting file accesses to external blobs.
Sure, Wedson Almeida Filho already started working on it [2][3]. As you probably know, he recently stepped down as a Rust for Linux maintainer, citing "nontechnical nonsense" and giving one such example [7]. I believe reimplementing the ext2 filesystem would only be a worthwhile effort if the kernel community could agree that in the end, after doing this work, the filesystem abstractions would be merged. Otherwise, people keep moving the goalpost and we'll never make progress with Rust filesystem abstractions. [1] https://github.com/dragonflyoss/nydus/blob/master/docs/nydus-design.md#2-rafs |
I think I responded in the email thread and the LWN reply, for example:
IMHO, a filesystem design is language-natual, it shouldn't be hard to work out in C, Rust or Go version. Yiyang has worked out a very preliminary version of a Rust core EROFS format. And I think it's a good start anyway. But also IMHO note that even Rust it cannot avoid quite complex livelock or deadlock issues (currently in some complex FS path there could already be a couple of VFS/MM locks nested).
Both RAFS v5 and v6 works with the same functionality (tiny metadata + external blobs), just v5 was a customized format with FUSE and v6 uses EROFS format.
Actually previously all in-kernel filesystems are blockdevice-based. In addition, from the history is on the kernel side, it once refused file-backed mounts, If you see the comment of filp_open() ,it clearly document
Until now, I still believe file-backed mount feature for a generic read-write filesystem is totally broken (that is why loop devices are needed to use workers as another kernel context) but EROFS isn't necessary to follow this as a complete immutable fs (since there seems enough container use cases to deploy EROFS in files like the current composefs and other Android use cases. Maybe later app sandboxes.). So in the Linux v6.12, EROFS already supports file-backed mounts as in https://www.phoronix.com/news/EROFS-Linux-6.12 and https://lwn.net/SubscriberLink/990750/1715b5d6307ee41a/. Previously EROFS supports "EROFS over fscache" to utilize a kernel caching framework (in a CAS way) to refer to external blobs, but that is somewhat complicated, and I think we could just ignore this since it tends to be deprecated.
Honestly, I think that is the only difference between the current EROFS and Puzzlefs. I will explain the current status and why: For each EROFS filesystem, there is an on-disk external device/blob table (at most 65536 external blobs), and you could think an erofs filesystem is a linear address space, like:
You could see, the linear address space form the filesystem LBAs and each variable-sized chunk can point to the arbitary LBA in the linear address space so that it could refer to any data part of (blob0..n). In other words, each EROFS filesystem could reuse any data from 65536 blobs at most. The only difference is that Puzzlefs supports per-extent blob (so in the extreme cases that could be milliion or more blobs in a puzzlefs filesystem). The reason why EROFS doesn't add per-extent blob support is that (from the kernel developper perspersive) it's somewhat hard to convince kernel folks from corner deadlock issues in the page fault path (and most kernel people will just say "no". Also, it seems puzzlefs kernel version doesn't implement mmap page faults at all). I just talked this stuff with a FS expert and a XFS core developer Dave Chinner, I have to post here for reference in that that you don't believe in me:
Yeah, but EROFS supports a finer chunk deduplicated data reference by using external blobs: just Composefs doesn't use it since they'd like to share page cache among images too.
As a filesystem developper for years, I don't think the discussion is totally "nontechnical".
I think all kernel developpers are hoping for a full-featured Rust ext2 filesystem in the beginning. |
I've explained why each EROFS/PuzzleFS file backed by multiple backing files is controversial in the previous reply. Another thing I wonder here is the real benefits between file-based deduplication ans subfile-based deduplication, because as far as I can see, most dynamic execuable code like dynamic libraries won't have any benefit as I mentioned in comment opencontainers/image-spec#1190 (comment) So I guess ComposeFS may already meet your use cases? In short, if an idea sounds reasonable to be implemented to kernel, I've seen no reason to avoid landing it in EROFS. But there are many things considered as unsafe in kernel indeed. For that cases, userspace approach are prefered and at least it won't hang the kernel due to corner paths. |
hi @hsiangkao, thanks for the feedback.
of course the goal would eventually be to implement mmap(), so I think the criticisms are valid. One simple solution is to open all of a file's member blobs during open(), rather than fault() or read() or whatever. Of course for very large files spanning many blobs this is ugly, but we do not expect those in container filesystems generally. Does that sounds reasonable?
This is also a great point: if the goal is to optimize for low use of local storage, we should align the puzzlefs chunk size with the local filesystem block size. One of the major design goals of PuzzleFS is to share image chunks across images as best it can, to minimize the delta needed to launch a new container image, so new container images can be launched as fast as possible. Ideally, we would only have to download chunks that are unique to the new container we are trying to launch. In this usage mode, the thing we are trying to optimize for is launch latency, not local disk size (in general, disk is much cheaper than CPU). One of the things I had been struggling with was whether to make the chunking constants part of the image standard, or administrator configurable. Given that some orgs may want to optimize for local storage size, and others for launch latency, it seems like we should probably have recommended chunking parameters for each, but not hard code them. |
As a generic filesystem solution, I don't think we should make any special assumption.
In brief, IMHO, I don't think this model is suitable as a kernel fs in general, you could still ignore my suggestion (I've taken too much time to write down these with my extra time out of my job), but I promise although it's trivial to be implemented in kernel EROFS, the complex nested kernel context risk makes it hard to convince filesystem kernel developpers. |
yeah, I guess that's maybe my second point: different applications should tune their image chunk size differently. but,
absolutely agreed there.
right, what per-file dedup can't do is share small edits across large files, e.g. in the ML models you cited above. we are trying to solve both problems.
With FastCDC we can get rid of coupling to file size, but you have to choose some size to distribute, it may as well be the chunk size, assuming the chunk size is chosen for optimal sharing. |
...
I'm pretty skeptical how CDC performs well between the same ML model among different versions, but I've never tried before. I guess ML needs some special delta compression way to reduce duplication better.
Variable-sized chunks (e.g. generated by CDC) for uncompressed data will have worse read performance compared to fixed-sized chunks (e.g. aligned with storage block size), that's a reason that uncompressed fses almost use fixed-sized chunks. You could use FIO to benchmark a fast storage to confirm that. For compressed fs, it doesn't matter since it needs some data transformation at least. |
I suggest comparing your solution with other deduplication solutions, such as borg, casync, desync, rdedup. Compare not only size of deduplicated and compressed data, but also speed of creating deduplicated image and speed of reading data back. Here is my own benchmark, which compares many solutions: borgbackup/borg#7674 (comment) . I recommend reading whole discussion, it contains many ideas. My conclusion is so: existing solutions are very inefficient, every one contains some inefficiency. Either it is not parallel, either it is written in Go instead of C/C++/Rust and thus slow, either it has some another problem. Based on my experience, I conclude that one can easily create deduplicating solution, which will beat in terms of size and speed all others.
Here is summary of my critique of other solutions: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl .
I didn't study puzzlefs closely. But from what I see I already see one inefficiency: puzzlefs uses sha256, which is on many machines (including mine) slower than blake2 and blake3.
I don't plan adding puzzlefs to my comparison borgbackup/borg#7674 (comment) . I already deleted test data and test programs.
You may say that all this (i. e. speed) is not important. I disagree. I had one particular use case: "docker, but for VMs". I compared existing solutions, and all them was unsatisfactory. So I developed my own: azwyon (in Rust): borgbackup/borg#7674 (comment) . Azwyon can store and extract 10 GiB virtual machines in several seconds on my machine. This is satisfactory for me. And this result is unachievable with other solutions. (Keep in mind that azwyon uses fixed-size chunking.) Similarly, it is possible that someone will evaluate puzzlefs, concludes that it is slow and rejects it.
Please, understand me correctly. I don't say that puzzlefs is slow! I didn't test it, so I don't know. I just want to share my experience. My message is so: "Many other solutions are slower than they may be, so it is possible that puzzlefs is slow, too, so I suggest properly testing it, and, if it actually turns out to be slow, fix it using ideas from these links"
The text was updated successfully, but these errors were encountered: