Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I suggest doing speed/size comparison with other dedup. solutions #114

Open
safinaskar opened this issue Sep 26, 2023 · 18 comments
Open

I suggest doing speed/size comparison with other dedup. solutions #114

safinaskar opened this issue Sep 26, 2023 · 18 comments

Comments

@safinaskar
Copy link

I suggest comparing your solution with other deduplication solutions, such as borg, casync, desync, rdedup. Compare not only size of deduplicated and compressed data, but also speed of creating deduplicated image and speed of reading data back. Here is my own benchmark, which compares many solutions: borgbackup/borg#7674 (comment) . I recommend reading whole discussion, it contains many ideas. My conclusion is so: existing solutions are very inefficient, every one contains some inefficiency. Either it is not parallel, either it is written in Go instead of C/C++/Rust and thus slow, either it has some another problem. Based on my experience, I conclude that one can easily create deduplicating solution, which will beat in terms of size and speed all others.

Here is summary of my critique of other solutions: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl .

I didn't study puzzlefs closely. But from what I see I already see one inefficiency: puzzlefs uses sha256, which is on many machines (including mine) slower than blake2 and blake3.

I don't plan adding puzzlefs to my comparison borgbackup/borg#7674 (comment) . I already deleted test data and test programs.

You may say that all this (i. e. speed) is not important. I disagree. I had one particular use case: "docker, but for VMs". I compared existing solutions, and all them was unsatisfactory. So I developed my own: azwyon (in Rust): borgbackup/borg#7674 (comment) . Azwyon can store and extract 10 GiB virtual machines in several seconds on my machine. This is satisfactory for me. And this result is unachievable with other solutions. (Keep in mind that azwyon uses fixed-size chunking.) Similarly, it is possible that someone will evaluate puzzlefs, concludes that it is slow and rejects it.

Please, understand me correctly. I don't say that puzzlefs is slow! I didn't test it, so I don't know. I just want to share my experience. My message is so: "Many other solutions are slower than they may be, so it is possible that puzzlefs is slow, too, so I suggest properly testing it, and, if it actually turns out to be slow, fix it using ideas from these links"

@ariel-miculas
Copy link
Collaborator

Hey @safinaskar, thanks for your suggestion. Did you document the steps for reproducing your benchmark results?
How did you determine the values for
Size Total ctime Total dtime Last ctime Last dtime?

@safinaskar
Copy link
Author

Did you document the steps for reproducing your benchmark results?

Yes, everything is in that borg bug report. First you need to create files 00, 01, etc. Here are instructions: borgbackup/borg#7674 (comment) . But, of course, it is okay to use some other realistic data instead. Keep in mind that I use nearly 10 similar snapshots of 10 GiB data. Or I can simply send them to you (I just understood that I can very easily re-generate them.)

Then run runner/benchmark: https://paste.gg/p/anonymous/2bac07a20d5f48dbabd4d9e3ad2c4ab1 (from borgbackup/borg#7674 (comment) ). Keep in mind that runner uses sudo a lot, so run it on throw-away host or carefully review code before running.

More details on reproducing are in that bug report ( borgbackup/borg#7674 ). If you have any questions, ask, I will happily answer.

How did you determine the values for

In my terminology "compression" (a. k. a. "store") is all steps needed to place source file to "repo". It may include chunking, deduplicating and actual compression. For example, borg2 create or casync make is "compression". "Decompression" (a. k. a. "extract") is, again, all steps needed to fully extract one file from "repo" (usually this is actual decompression). For example, borg2 extract or casync extract is "decompression". So:

  • "Size" is size of "repo" after "compressing"/"storing" all source files. I. e. result of compressing and deduplicating
  • "Total ctime" is full time of "compressing" all files, i. e. full time of creating repo. Full time of converting of source files to single repo. "Compressing" of separate files is always invoked one-by-one (i. e. full time is sum of times for every file). But some deddupers may "compress" any single file in parallel
  • "Total dtime" is opposite of "Total ctime". Full time of "decompressing" of all files one-by-one from full repo
  • "Last ctime" is time of "compressing" of last file, i. e. "11", assuming all other are already stored in "repo"
  • "Last dtime" is time of "decompressing" last file from full repo

Typical output from runner/benchmark looks like this:

{"method":"Borg { chunker_params: \"buzhash,19,23,21,4095\", compression: \"zstd,3\", encryption: \"none\" }","size":2851383597,"compression":[18318222,11317438,21237556,45904208,44393790,40996476,48499568,45409089,46290323,44191118,46500109,34804936],"decompression":[8965775,8972195,9445975,35787014,36940309,36681542,37723561,37497282,38120893,37698965,37893902,38152120],"total_compression":447862839,"total_decompression":363879538}

"size" is size, "total_compression" is "total ctime" (in microseconds), last element of "compression" is "last ctime" (in microseconds)

I don't know whether puzzlefs can add some file to existing image. If this is not possible (i. e. if the only way to create puzzlefs image is to create everything at once), then "last ctime" has no sense for puzzlefs. Also, if we process separate files in parallel when creating puzzlefs image, then "total ctime" doesn't have direct puzzlefs counterpart, too. (But comparing puzzlefs image creating time with "total ctime" of other deduppers is still good task, it is possible then even in this situation other deduppers still will be faster in creating of repo.)

Also note that whole my benchmark is biased to my particular original problem. I was in search of dedupper for my "docker, but for VMs", I call it Jocker. Jocker always stores and extracts snapshots one-by-one. Also the blobs themselves are snapshots of single VM. This means that offset of particular file inside snapshot doesn't change. And thus CDC doesn't have any advantage over fixed sized chunking. No surprise then that my dedupper called azwyon with fixed sized chunking "won".

When benchmarking I used instructions for stabilizing benchmark results from https://easyperf.net/blog/2019/08/02/Perf-measurement-environment-on-Linux and https://llvm.org/docs/Benchmarking.html .

Some other random notes:

@ariel-miculas
Copy link
Collaborator

Thanks for the detailed reply!

A quick note: I did try to use rayon to parallelize puzzlefs build, but the single-threaded version was faster. I didn't save the test results, but I still have the branch: #84

@safinaskar
Copy link
Author

I suggest reading current version of azwyon: https://paste.gg/p/anonymous/3c43cb97f37440839d19392ca7f8dff8 . Assume public domain. It is nearly 150 lines of Rust. It differs from previous versions: I switched to pariter instead of my homegrown replacement.

Here are my comments on your code:

As well as I understand, pariter is not based on rayon thread pool, it uses its own thread pool. Despite this, I still highly recommend pariter. I tried to add pariter-like functionality to rayon: rayon-rs/rayon#1071 , but maintainers rejected my patch. Don't use this patch as-is, it is buggy.

Also: when writing azwyon, I studied borg source code in attempt to understand why it so fast. I borrowed some ideas. For example, borg memorizes hashes of sequences of zero bytes. Also, borg detects holes in input files (I didn't implement this). And borg drops input file from OS cache (sometimes this boosts performance, see borgbackup/borg#7674 for details). See borgbackup/borg#7674 for other ideas

@safinaskar
Copy link
Author

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

@safinaskar
Copy link
Author

rayon authors just acknowledged that rayon::spawn never blocks. So, yes, your loop seems to read whole file to memory, which is wrong. rayon-rs/rayon#1095 (comment)

@ariel-miculas
Copy link
Collaborator

Also: as well as I understand, "slow puzzlefs" already exists. It is borg. Borg has command "borg2 mount", it mounts some snapshot using FUSE. So, it seems the only way for puzzlefs to succeed is to be faster than "borg2 mount". Same for casync mount

I've looked at both casync and borg, here are some differences:

  • puzzlefs supports integrity checking with fsverity, similar to how composefs does it
  • I'm also implementing a Linux kernel driver for PuzzleFS, currently it's in the Proof-of-concept stage
  • unlike casync, borg or composefs, it's written in Rust (both the FUSE driver and the kernel driver): this gives us memory safety guarantees (this should not be underestimated) and comparable performance to C/C++ code

@safinaskar
Copy link
Author

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/[email protected]/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

@safinaskar
Copy link
Author

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

@ariel-miculas
Copy link
Collaborator

@ariel-miculas , I just found this article: https://lwn.net/Articles/934047/ . It seems EROFS is very advanced fs, and this will be very hard to complete with it. It supports CDC-based chunking using Rabin–Karp, deduplication and compression as described here: https://erofs.docs.kernel.org/en/latest/design.html . It supports DAX and many other things. It is one of the few filesystems, which don't need buffer_head: https://lore.kernel.org/lkml/[email protected]/ . EROFS is explicitly designed for containers. EROFS is already in mainline kernel.

If you try to submit your fs to kernel, it is quite possible that kernel developers will simply say to you: "We already have EROFS, why there is a need for puzzlefs?"

Yes, there is this possibility, also be aware I've already submitted two versions of PuzzleFS to the LKML: here's the first one and here's the second one. The latest version is avaible on github, I have not sent it to the mailing list because I expect some changes to the filesystem abstractions, since there were disagreements on the proposed APIs. You can read the entire thread here. If you want to take a look at the latest PuzzleFS version, I've compiled a list of resources in the PuzzleFS page of the Rust for Linux project, where you will find the latest PuzzleFS branch, as well as other resources such as the LWN article on PuzzleFS and my presentation at Open Source Summit Europe.

So it is possible it may be good idea not to spend your time on puzzlefs anymore. I don't try to insult you. I just try to save your time. I think EROFS is what your need for your original production task.

There are other container solutions which use EROFS, such as composefs and nydus. EROFS doesn't work for the problem that we are trying to solve, i.e. deduplicate the data using a content defined chunking algorithm so we can store multiple versions of the same image as efficiently as possible. EROFS uses deduplication, but only inside a single image, whereas we want to deduplicate multiple versions of the same image. With the help of content defined chunking, we can essentially only store the deltas between multiple versions of the same image instead of storing them separately, which is how archive formats do it (e.g. tar, squashfs, erofs). We want to redesign the container image format such that a more efficient representation based on variable sized chunks is possible. You can see the results in my PuzzleFS presentation, in the Results slide. I've also added them below.

Steps:

  • I’ve downloaded 10 versions of Jammy from hub.docker.com
  • These images only have one layer which is in tar.gz format
  • I’ve built 10 equivalent puzzlefs images
  • Compute the tarball_total_size by summing the sizes of every Jammy tarball (uncompressed) => 766 MB (use this as baseline)
  • Sum the sizes of every oci/puzzlefs image => total_size
  • Compute the total size as if all the versions were stored in a single oci/puzzlefs repository => total_unified_size
  • Saved space = tarball_total_size - total_unified_size
Type Total size (MB) Average layer size (MB) Unified size (MB) Saved (MB) / 766 MB
Oci (uncompressed) 766 77 766 0 (0%)
PuzzleFS uncompressed 748 74 130 635 (83%)
Oci (compressed) 282 28 282 484 (63%)
PuzzleFS (compressed) 298 30 53 713 (93%)

But if you still think this is good idea to submit puzzlefs to the kernel, then be prepared for EROFS competition. Prepare benchmarks against EROFS. Ideally on realistic data, i. e. on realistic containers. And work hard to make puzzlefs perform good compared to EROFS and .tar.zst.

I'm aware it will be difficult to convince the kernel maintainers to merge PuzzleFS, they've already complained that they get too many container filesystems submitted to them. Time will tell whether they consider PuzzleFS to be worthy of upstreaming.

I don't try to say that puzzlefs is bad. I just want to say that I think this will be hard to convince kernel devs to accept puzzlefs. It may be good idea to show current state of puzzlefs kernel driver to core kernel stakeholders (i. e. Torvalds, Greg KH, Al Viro, etc) and get their opinion: whether there are chances to get this fs to mainline, i. e. whether is it worth to spend time on puzzlefs kernel driver

As I've mentioned above, I've already submitted multiple versions of the PuzzleFS driver, I've presented it to multiple conferences, so at least I have brought awareness to the PuzzleFS project. And other people have written articles about it, such as Phoronix and LWN. Wedson Almeida is working on his own filesytem in Rust, tarfs, so I'm not the only one trying to upstream new filesystems.

@ariel-miculas , article on how hard it is to get your filesystem to kernel: https://lwn.net/Articles/933616/ .

Notable phrase:

If you ask a room full of kernel developers if a new filesystem is needed, "the answer is almost always going to be 'no'", Josef Bacik said to laughter

I think it's worth exploring what a new filesystem implemented in Rust can achieve, since kernel maintainers sometimes argue that they want to see new things implemented in Rust rather than having existing implementations duplicated in both C and Rust. Of course, it's understandable they don't want multiple filesystems which solve the same container image problem, so pushback is to be expected.

@safinaskar thanks for presenting your concerns, we definitely have to make a strong case for PuzzleFS if we ever want it to be included in the Linux kernel.

@hsiangkao
Copy link

hsiangkao commented Sep 23, 2024

There are other container solutions which use EROFS, such as composefs and nydus. EROFS doesn't work for the problem that we are trying to solve, i.e. deduplicate the data using a content defined chunking algorithm so we can store multiple versions of the same image as efficiently as possible. EROFS uses deduplication, but only inside a single image, whereas we want to deduplicate multiple versions of the same image. With the help of content defined chunking, we can essentially only store the deltas between multiple versions of the same image instead of storing them separately, which is how archive formats do it (e.g. tar, squashfs, erofs). We want to redesign the container image format such that a more efficient representation based on variable sized chunks is possible. You can see the results in my PuzzleFS presentation, in the Results slide. I've also added them below.

Hi @ariel-miculas , I'm not sure why you just ignore all my reply in the mailing list.

First, EROFS supports deduplicate multiple versions of the same image by using external blobs (and Nydus also supports this just from what you referenced many times [1]. Nydus was called "Dragonfly image service" in the document).

And encoded EROFS already supports variable-sized chunks + CDC. I'm not sure if you correctly parse all EROFS current features.

[1] https://hackmd.io/@cyphar/ociv2-brainstorm

since kernel maintainers sometimes argue that they want to see new things implemented in Rust rather than having existing implementations duplicated in both C and Rust.

I think some experienced file system developers asked for an ext2 filesystem implementation first.

@ariel-miculas
Copy link
Collaborator

Hey @hsiangkao

Hi @ariel-miculas , I'm not sure why you just ignore all my reply in the mailing list.

Apologies for making you feel unheard, rest assured it is not my intention to ignore your feedback. Would you kindly point me to the messages that you've sent and I didn't reply to, so I can understand how I might have missed them?

Please also note that the open-source projects sponsored by Cisco are interested in integrating erofs, e.g. stacker [8]. And if we can implement the features of PuzzleFS using EROFS instead of a custom filesystem representation, it would save us the headache of upstreaming another container-related FS into the Linux Kernel, which is no small feat. At some point I would also like to see a reimplementation of EROFS in Rust, which I see there's already interest for [9]. That's because one of our design goals with PuzzleFS is memory safety (this is what led to the decision of implementing it in Rust).

First, EROFS supports deduplicate multiple versions of the same image by using external blobs (and Nydus also supports this just from what you referenced many times [1]. Nydus was called "Dragonfly image service" in the document).

I did look at Nydus and I was hoping to see the docs for Rafs v6 (which is compatible with the in-kernel EROFS filesystem) [1], but it seems they are still work in progress [1]:
"Note that, the following details are all about Rafs v5 and meterials about Raft v6 are still working in progress."

And encoded EROFS already supports variable-sized chunks + CDC. I'm not sure if you correctly parse all EROFS current features.

It is quite possible I don't understand all the EROFS features. I was under the impression that the erofs CDC only works with a block device. Could you help me with some documentation/examples about referencing external blobs? I can't tell how this works based on the erofs documentation [4]. What PuzzleFS does is that it stores reference to external files in its metadata (the filenames are a sha256 digest) and then it recreates the original filesystem by reassembling the multiple chunks that a file is composed of.

If I'm not mistaken, composefs [5] uses a feature of overlayfs for redirecting file accesses to external blobs.

I think some experienced file system developers asked for an ext2 filesystem implementation first.

Sure, Wedson Almeida Filho already started working on it [2][3]. As you probably know, he recently stepped down as a Rust for Linux maintainer, citing "nontechnical nonsense" and giving one such example [7].

I believe reimplementing the ext2 filesystem would only be a worthwhile effort if the kernel community could agree that in the end, after doing this work, the filesystem abstractions would be merged. Otherwise, people keep moving the goalpost and we'll never make progress with Rust filesystem abstractions.

[1] https://github.com/dragonflyoss/nydus/blob/master/docs/nydus-design.md#2-rafs
[2] https://github.com/wedsonaf/linux/tree/ext2fs
[3] https://lore.kernel.org/all/[email protected]/
[4] https://docs.kernel.org/filesystems/erofs.html
[5] https://github.com/containers/composefs
[6] https://lore.kernel.org/lkml/[email protected]/
[7] https://www.youtube.com/watch?v=WiPp9YEBV0Q&t=1529s
[8] project-stacker/stacker#626
[9] https://lore.kernel.org/all/[email protected]/

@hsiangkao
Copy link

hsiangkao commented Sep 24, 2024

Hey @hsiangkao

Hi @ariel-miculas , I'm not sure why you just ignore all my reply in the mailing list.

Apologies for making you feel unheard, rest assured it is not my intention to ignore your feedback. Would you kindly point me to the messages that you've sent and I didn't reply to, so I can understand how I might have missed them?

I think I responded in the email thread and the LWN reply, for example:
https://lore.kernel.org/linux-fsdevel/[email protected]/
It seems that there is no reply of this.

Please also note that the open-source projects sponsored by Cisco are interested in integrating erofs, e.g. stacker [8]. And if we can implement the features of PuzzleFS using EROFS instead of a custom filesystem representation, it would save us the headache of upstreaming another container-related FS into the Linux Kernel, which is no small feat. At some point I would also like to see a reimplementation of EROFS in Rust, which I see there's already interest for [9]. That's because one of our design goals with PuzzleFS is memory safety (this is what led to the decision of implementing it in Rust).

IMHO, a filesystem design is language-natual, it shouldn't be hard to work out in C, Rust or Go version. Yiyang has worked out a very preliminary version of a Rust core EROFS format. And I think it's a good start anyway.

But also IMHO note that even Rust it cannot avoid quite complex livelock or deadlock issues (currently in some complex FS path there could already be a couple of VFS/MM locks nested).
Besides, I'm not sure how it implements some advanced kernel locks like RCUs for now.

First, EROFS supports deduplicate multiple versions of the same image by using external blobs (and Nydus also supports this just from what you referenced many times [1]. Nydus was called "Dragonfly image service" in the document).

I did look at Nydus and I was hoping to see the docs for Rafs v6 (which is compatible with the in-kernel EROFS filesystem) [1], but it seems they are still work in progress [1]: "Note that, the following details are all about Rafs v5 and meterials about Raft v6 are still working in progress."

Both RAFS v5 and v6 works with the same functionality (tiny metadata + external blobs), just v5 was a customized format with FUSE and v6 uses EROFS format.

And encoded EROFS already supports variable-sized chunks + CDC. I'm not sure if you correctly parse all EROFS current features.

It is quite possible I don't understand all the EROFS features. I was under the impression that the erofs CDC only works with a block device.

Actually previously all in-kernel filesystems are blockdevice-based. In addition, from the history is on the kernel side, it once refused file-backed mounts, If you see the comment of filp_open() ,it clearly document

 * This is the helper to open a file from kernelspace if you really
 * have to.  But in generally you should not do this, so please move
 * along, nothing to see here..

Until now, I still believe file-backed mount feature for a generic read-write filesystem is totally broken (that is why loop devices are needed to use workers as another kernel context) but EROFS isn't necessary to follow this as a complete immutable fs (since there seems enough container use cases to deploy EROFS in files like the current composefs and other Android use cases. Maybe later app sandboxes.). So in the Linux v6.12, EROFS already supports file-backed mounts as in https://www.phoronix.com/news/EROFS-Linux-6.12 and https://lwn.net/SubscriberLink/990750/1715b5d6307ee41a/.

Previously EROFS supports "EROFS over fscache" to utilize a kernel caching framework (in a CAS way) to refer to external blobs, but that is somewhat complicated, and I think we could just ignore this since it tends to be deprecated.

Could you help me with some documentation/examples about referencing external blobs? I can't tell how this works based on the erofs documentation [4]. What PuzzleFS does is that it stores reference to external files in its metadata (the filenames are a sha256 digest) and then it recreates the original filesystem by reassembling the multiple chunks that a file is composed of.

Honestly, I think that is the only difference between the current EROFS and Puzzlefs. I will explain the current status and why:

For each EROFS filesystem, there is an on-disk external device/blob table (at most 65536 external blobs), and you could think an erofs filesystem is a linear address space, like:

| metadata | blob 0 | blob 1 | .... | blob n |

You could see, the linear address space form the filesystem LBAs and each variable-sized chunk can point to the arbitary LBA in the linear address space so that it could refer to any data part of (blob0..n). In other words, each EROFS filesystem could reuse any data from 65536 blobs at most.
I believe it also takes the same storage. For example, even some puzzlefs chunk blob takes 17.5 KiB, it will take 20 KiB (5 blocks in 4KiB) in the underlay filesystem (ext4, xfs, btrfs, etc. for example). And the EROFS chunk blob could also be 17.5KiB, it's just mapped into the linear address space in 5 blocks.

The only difference is that Puzzlefs supports per-extent blob (so in the extreme cases that could be milliion or more blobs in a puzzlefs filesystem). The reason why EROFS doesn't add per-extent blob support is that (from the kernel developper perspersive) it's somewhat hard to convince kernel folks from corner deadlock issues in the page fault path (and most kernel people will just say "no". Also, it seems puzzlefs kernel version doesn't implement mmap page faults at all). I just talked this stuff with a FS expert and a XFS core developer Dave Chinner, I have to post here for reference in that that you don't believe in me:
image
image
If you have more questions about open files in the page fault path directly, I could also send out a more formal email to ask for this use cases in the related mailing lists for wider discussion, but I think no need to waste time on this.

If I'm not mistaken, composefs [5] uses a feature of overlayfs for redirecting file accesses to external blobs.

Yeah, but EROFS supports a finer chunk deduplicated data reference by using external blobs: just Composefs doesn't use it since they'd like to share page cache among images too.

I think some experienced file system developers asked for an ext2 filesystem implementation first.

Sure, Wedson Almeida Filho already started working on it [2][3]. As you probably know, he recently stepped down as a Rust for Linux maintainer, citing "nontechnical nonsense" and giving one such example [7].

As a filesystem developper for years, I don't think the discussion is totally "nontechnical".
I've checked out his slices, there are clearly lifetime issues in his VFS abstraction and some are pointed out by Al Viro in the original on-site topic.

I believe reimplementing the ext2 filesystem would only be a worthwhile effort if the kernel community could agree that in the end,

I think all kernel developpers are hoping for a full-featured Rust ext2 filesystem in the beginning.

@hsiangkao
Copy link

hsiangkao commented Sep 24, 2024

Please also note that the open-source projects sponsored by Cisco are interested in integrating erofs, e.g. stacker [8]. And if we can implement the features of PuzzleFS using EROFS instead of a custom filesystem representation, it would save us the headache of upstreaming another container-related FS into the Linux Kernel, which is no small feat. At some point I would also like to see a reimplementation of EROFS in Rust, which I see there's already interest for [9]. That's because one of our design goals with PuzzleFS is memory safety (this is what led to the decision of implementing it in Rust).

I've explained why each EROFS/PuzzleFS file backed by multiple backing files is controversial in the previous reply.

Another thing I wonder here is the real benefits between file-based deduplication ans subfile-based deduplication, because as far as I can see, most dynamic execuable code like dynamic libraries won't have any benefit as I mentioned in comment opencontainers/image-spec#1190 (comment)

So I guess ComposeFS may already meet your use cases?
Or if EROFS with external blobs a.k.a Dragonfly Nydus way meets your use cases?

In short, if an idea sounds reasonable to be implemented to kernel, I've seen no reason to avoid landing it in EROFS. But there are many things considered as unsafe in kernel indeed. For that cases, userspace approach are prefered and at least it won't hang the kernel due to corner paths.

@tych0
Copy link
Collaborator

tych0 commented Oct 1, 2024

hi @hsiangkao, thanks for the feedback.

The reason why EROFS doesn't add per-extent blob support is that (from the kernel developper perspersive) it's somewhat hard to convince kernel folks from corner deadlock issues in the page fault path (and most kernel people will just say "no". Also, it seems puzzlefs kernel version doesn't implement mmap page faults at all).

of course the goal would eventually be to implement mmap(), so I think the criticisms are valid. One simple solution is to open all of a file's member blobs during open(), rather than fault() or read() or whatever. Of course for very large files spanning many blobs this is ugly, but we do not expect those in container filesystems generally. Does that sounds reasonable?

For example, even some puzzlefs chunk blob takes 17.5 KiB, it will take 20 KiB (5 blocks in 4KiB) in the underlay filesystem (ext4, xfs, btrfs, etc. for example).

This is also a great point: if the goal is to optimize for low use of local storage, we should align the puzzlefs chunk size with the local filesystem block size.

One of the major design goals of PuzzleFS is to share image chunks across images as best it can, to minimize the delta needed to launch a new container image, so new container images can be launched as fast as possible. Ideally, we would only have to download chunks that are unique to the new container we are trying to launch. In this usage mode, the thing we are trying to optimize for is launch latency, not local disk size (in general, disk is much cheaper than CPU).

One of the things I had been struggling with was whether to make the chunking constants part of the image standard, or administrator configurable. Given that some orgs may want to optimize for local storage size, and others for launch latency, it seems like we should probably have recommended chunking parameters for each, but not hard code them.

@hsiangkao
Copy link

hsiangkao commented Oct 2, 2024

hi @hsiangkao, thanks for the feedback.

The reason why EROFS doesn't add per-extent blob support is that (from the kernel developper perspersive) it's somewhat hard to convince kernel folks from corner deadlock issues in the page fault path (and most kernel people will just say "no". Also, it seems puzzlefs kernel version doesn't implement mmap page faults at all).

of course the goal would eventually be to implement mmap(), so I think the criticisms are valid. One simple solution is to open all of a file's member blobs during open(), rather than fault() or read() or whatever. Of course for very large files spanning many blobs this is ugly, but we do not expect those in container filesystems generally. Does that sounds reasonable?

As a generic filesystem solution, I don't think we should make any special assumption.
Considering an AI model, there could be several GiB (or even more) files, I don't think opening all chunks (e.g several KiB) or refuse any strange chunk size in mmap is reasonable. It doesn't sound like a generic fs that could do.

For example, even some puzzlefs chunk blob takes 17.5 KiB, it will take 20 KiB (5 blocks in 4KiB) in the underlay filesystem (ext4, xfs, btrfs, etc. for example).

This is also a great point: if the goal is to optimize for low use of local storage, we should align the puzzlefs chunk size with the local filesystem block size.

One of the major design goals of PuzzleFS is to share image chunks across images as best it can, to minimize the delta needed to launch a new container image, so new container images can be launched as fast as possible. Ideally, we would only have to download chunks that are unique to the new container we are trying to launch. In this usage mode, the thing we are trying to optimize for is launch latency, not local disk size (in general, disk is much cheaper than CPU).

One of the things I had been struggling with was whether to make the chunking constants part of the image standard, or administrator configurable. Given that some orgs may want to optimize for local storage size, and others for launch latency, it seems like we should probably have recommended chunking parameters for each, but not hard code them.

In brief, IMHO, I don't think this model is suitable as a kernel fs in general, you could still ignore my suggestion (I've taken too much time to write down these with my extra time out of my job), but I promise although it's trivial to be implemented in kernel EROFS, the complex nested kernel context risk makes it hard to convince filesystem kernel developpers.
There is also no comparsion between this way and per-file deduplication (aka. ComposeFS), I don't think it could cause significant difference from ComposeFS with ubuntu:jammy 10 versions.
As for the distribution side, you could always use another size other than file-size or chunk-size, why need to couple them?

@tych0
Copy link
Collaborator

tych0 commented Oct 2, 2024

I don't think opening all chunks (e.g several KiB)

yeah, I guess that's maybe my second point: different applications should tune their image chunk size differently. but,

refuse any strange chunk size in mmap is reasonable. It doesn't sound like a generic fs that could do.

absolutely agreed there.

There is also no comparsion between this way and per-file deduplication (aka. ComposeFS), I don't think it could cause significant difference from ComposeFS with ubuntu:jammy 10 versions.

right, what per-file dedup can't do is share small edits across large files, e.g. in the ML models you cited above. we are trying to solve both problems.

As for the distribution side, you could always use another size other than file-size or chunk-size, why need to couple them?

With FastCDC we can get rid of coupling to file size, but you have to choose some size to distribute, it may as well be the chunk size, assuming the chunk size is chosen for optimal sharing.

@hsiangkao
Copy link

hsiangkao commented Oct 3, 2024

...

There is also no comparsion between this way and per-file deduplication (aka. ComposeFS), I don't think it could cause significant difference from ComposeFS with ubuntu:jammy 10 versions.

right, what per-file dedup can't do is share small edits across large files, e.g. in the ML models you cited above. we are trying to solve both problems.

I'm pretty skeptical how CDC performs well between the same ML model among different versions, but I've never tried before. I guess ML needs some special delta compression way to reduce duplication better.

As for the distribution side, you could always use another size other than file-size or chunk-size, why need to couple them?

With FastCDC we can get rid of coupling to file size, but you have to choose some size to distribute, it may as well be the chunk size, assuming the chunk size is chosen for optimal sharing.

Variable-sized chunks (e.g. generated by CDC) for uncompressed data will have worse read performance compared to fixed-sized chunks (e.g. aligned with storage block size), that's a reason that uncompressed fses almost use fixed-sized chunks. You could use FIO to benchmark a fast storage to confirm that. For compressed fs, it doesn't matter since it needs some data transformation at least.
But the distribution size could be just adjusted by optimal http request size, and blob can be merged or splitted on demand, which I think that is independent to the filesystem (image) content. That is just my own opinion of this stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants