Overcommitting storage silently corrupts data #652

crackerjam · 2023-08-16T00:52:27Z

crackerjam
Aug 16, 2023

Trying to copy a 10GB file to a 2GB Stratis filesystem appears to succeed, but in reality causes the file to be corrupted. To reproduce:

Login to a RHEL 9.2 system as root (System has a 20GB root disk, plus a 2GB blank disk)
Install/enable/start stratisd and stratis-cli (version 3.4.1)
yum -y install stratisd stratis-cli; systemctl enable --now stratisd
stratis pool create testpool /dev/sdb
stratis filesystem create testpool testfilesystem
mkdir -p /mnt/test
Add this line to /etc/fstab
/dev/stratis/testpool/testfilesystem /mnt/test xfs defaults,x-systemd.requires=stratisd.service 0 0
mount -a
Download a large file to /root for testing
wget -P /root http://mirrors.rit.edu/centos/7.9.2009/isos/x86_64/CentOS-7-x86_64-Everything-2009.iso
Copy the file to the Stratis filesystem
cp /root/CentOS-7-x86_64-Everything-2009.iso /mnt/test
Hash the test file in both locations
md5sum /root/CentOS-7-x86_64-Everything-2009.iso
md5sum /mnt/test/CentOS-7-x86_64-Everything-2009.iso

These checksums will come back different, indicating data corruption.

mulkieran · 2023-08-16T20:33:26Z

mulkieran
Aug 16, 2023
Maintainer

Thanks for the report. We believe that this problem has been addressed in subsequent stratisd releases, which are available in RHEL 9. Here is some documentation on the design changes that we made in Stratis 3 to make extension of the Stratis filesystems more robust: https://stratis-storage.github.io/thin-provisioning-redesign/ . If you reproduce the same problem in a more recent version of Stratis, please let us know, we will be eager to investigate. Thanks!

0 replies

crackerjam · 2023-08-17T18:40:26Z

crackerjam
Aug 17, 2023
Author

@mulkieran This is NOT fixed on RHEL 9.2 with stratis 3.4.1. I just re-ran the steps in my original post on a fresh system and see the same behavior.

I've updated my original post to include those versions.

0 replies

mulkieran · 2023-08-21T20:27:39Z

mulkieran
Aug 21, 2023
Maintainer

@crackerjam Can you confirm that /dev/sdb is the 2 GiB device that you have in your initial statement? I may have initially misread your statement, and thought that the device you were using for the Stratis pool was the 20 GiB one that you mention first. If that is the case, i.e., that the pool itself is only 2 GiB there is not much to do, because the cp command itself does not perform any sync operation or confirm that the write has completed.

The pool itself is supposed to maintain information about itself, and can give some warnings. When you first create the pool, it will assess its state and may issue a warning, which will be seen with the stratis pool list command. After you issue the cp command, the pool will try to accommodate the additional data by expanding the filesystem. If you list the pool after the failed cp command, you may see additional warnings.

0 replies

crackerjam · 2023-08-21T22:14:10Z

crackerjam
Aug 21, 2023
Author

@mulkieran Correct, /dev/sdb is a 2GiB disk.

If that is the case, i.e., that the pool itself is only 2 GiB there is not much to do, because the cp command itself does not perform any sync operation or confirm that the write has completed.

That doesn't seem to be correct. If I just format the raw disk with XFS directly and try to run the copy, I receive:

cp: error writing '/mnt/test/CentOS-7-x86_64-Everything-2009.iso': No space left on device

If you list the pool after the failed cp command, you may see additional warnings.

stratis pool list does show an alert of "WS0001" next to the pool, which is something, but honestly not very helpful. Googling "stratis WS0001" doesn't return any results, so I'm not even able to identify what this error code means.

Think of a scenario where you have a stratis volume that receives and stores critical business documents. Sure, ideally you would have monitoring set up for disk usage, but let's say you don't. With any other filesystem your application would start receiving errors when the disk is full. Users will see that something's going on, it will be clear that things aren't working properly. Things will grind to a halt, but you won't lose data.

With this issue, everything just appears to silently work. Files show up on disk and report correct sizes. However, when you read the files, they're junk. Depending on the use case you could go weeks or months without noticing that all of your data is just evaporating. This should be an extremely critical priority issue.

0 replies

mulkieran · 2023-08-22T01:48:37Z

mulkieran
Aug 22, 2023
Maintainer

@crackerjam In a case like this, you would be advised to create a pool with no overprovisioning mode set. In such a pool the size of the XFS filesystem can not be made any larger that the actual space that is available to it. In that case, the filesystem will behave like the one that you placed on the raw device.

For the raw device, check the filesystem size. You'll notice that it is significantly smaller than the default filesystem size, 1 TiB, that Stratis creates when overprovisioning is allowed.

0 replies

crackerjam · 2023-08-22T03:28:59Z

crackerjam
Aug 22, 2023
Author

@mulkieran I don't think that's really an appropriate stance. What is 'a case like this'? A case where you don't want your files to get corrupted? No filesystem, ever, should report back that a write succeeded when it did not, regardless of options used on creation. That's just a core tenant of how filesystems work. It is especially true here, when I'm using the default options.

I have to be honest, I only discovered Stratis as part of some RHEL training. Personally, and professionally, if I ever get a whiff of anyone thinking about using Stratis I will tell them to steer clear due to how this bug is being handled. In my opinion, this is a massive issue that needs to be resolved if you want this to be taken seriously as a replacement for LVM, and I can only imagine how many other data-destroying bugs there are that have been ignored because "well you should have just used the right command arguments".

If you'd like any more assistance testing scenarios around this, please feel free to let me know. Otherwise, it seems like you have a particular opinion here that I probably won't be able to change.

0 replies

mulkieran · 2023-08-22T12:42:33Z

mulkieran
Aug 22, 2023
Maintainer

@crackerjam I'm afraid that you are failing to understand how filesystems work. To produce an analogous situation with LVM, simply create a thin pool on your 2 GiB device, construct a thin device of say, 1 TiB, on that thinpool, create an XFS filesystem on it, and do your cp as before. The cp will return success, but the file will not be copied in toto.

It is surprising to some people that that is how filesystems behave, but it is a consequence of the original design choices made for filesystems in a simple world that did not include thinly provisioned devices supplied by kernel modules.

If you want to do a copy that is sure to fail if there is really no place to write the bytes, you will have to explicitly sync your data.

2 replies

crackerjam Aug 22, 2023
Author

Okay, I was able to replicate that with LVM and see what you're saying. However, it took a bit to get there and I continue to have some gripes.

This is not default behavior with LVM. You need to create a thin volume inside a thin pool in order get this to work, along with specifying that the volume is larger than the pool it goes into, which isn't really a standard configuration. It has uses, sure, but I wouldn't say that it's common.
When you create volumes like this LVM produces a big warning message alerting that this may break things.

I still think, ideally, stratis volumes should report write failures back up the stack when situations like this occur. If the goal is to be better than LVM this would be a place open for improvement. However, if that isn't possible there should at least be feedback when creating filesystems identifying this potential issue. The behavior is just so far outside of what 99% of Linux engineers will consider normal it really should be called out.

mulkieran Nov 27, 2023
Maintainer

@crackerjam The reason the default size for the filesystem is 1 TiB is to avoid a performance penalty that results when an XFS filesystem is grown from a small size to a larger size. The general rule is that an XFS filesystem should not be increased in size by more than a factor of 8, in order to avoid this performance penalty. Stratis has introduced a new filesystem limits functionality in the most recent release[1]. So, as an administrator you can set that limit and when the filesystem reaches that limit it can not be grown and XFS will reject writes when the filesystem is out of space. However, the limit can not be less than the current filesystem size, so this would require planning when the filesystem is created.

You're correct that it would be ideal for the out-of-space condition on the thin device to be tidily propagated through the filesystem layer to the user program. But that requires a level of communication between the filesystem and the block layer which has yet to be achieved.

[1] https://stratis-storage.github.io/stratis-release-notes-3-6-0/

yyshell · 2023-11-27T02:40:23Z

yyshell
Nov 27, 2023

Now on RHEL 9.3, the problem is still not fixed. and if we use the "no-overprovision" option. the file system size can not grow any more. This is a very damning point. It's almost impossible for me to use Stratis in a production environment.

1 reply

jbaublitz Nov 27, 2023
Maintainer

If you are seeing that overprovisioning is causing your filesystem to stop growing, that is a sign that you have exhausted your physical space and need to add more physical storage.

wushilin · 2024-09-08T02:26:52Z

wushilin
Sep 8, 2024

Well I am surprised to see this, but it is indeed still an issue with 3.6.8 daemon and 3.6.2 CLI.

The file system is not corrupted, just that the file that apparently to be copied successfully, was truncated or zero sized.

1 reply

jbaublitz Sep 17, 2024
Maintainer

Currently, we are having discussions with the XFS and thin provisioning team about whether we would be able to get them to return ENOSPC on writes that would exhaust physical space. I'm leading the discussion, so I'll pass along that there's someone else expressing interest in this. This exact piece is out of Stratis's hands, and while we can advocate for different behavior, this is not something we can avoid short of providing the no-overprovisioning option that we already have to ensure that this doesn't happen. fsync still is the best way to see if your write operations have succeeded on any filesystem in Linux or whether it failed on writeback, but I will let you know if there's any motion on the XFS/thin provisioning integration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overcommitting storage silently corrupts data #652

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Overcommitting storage silently corrupts data #652

crackerjam Aug 16, 2023

Replies: 9 comments · 4 replies

mulkieran Aug 16, 2023 Maintainer

crackerjam Aug 17, 2023 Author

mulkieran Aug 21, 2023 Maintainer

crackerjam Aug 21, 2023 Author

mulkieran Aug 22, 2023 Maintainer

crackerjam Aug 22, 2023 Author

mulkieran Aug 22, 2023 Maintainer

crackerjam Aug 22, 2023 Author

mulkieran Nov 27, 2023 Maintainer

yyshell Nov 27, 2023

jbaublitz Nov 27, 2023 Maintainer

wushilin Sep 8, 2024

jbaublitz Sep 17, 2024 Maintainer

crackerjam
Aug 16, 2023

Replies: 9 comments 4 replies

mulkieran
Aug 16, 2023
Maintainer

crackerjam
Aug 17, 2023
Author

mulkieran
Aug 21, 2023
Maintainer

crackerjam
Aug 21, 2023
Author

mulkieran
Aug 22, 2023
Maintainer

crackerjam
Aug 22, 2023
Author

mulkieran
Aug 22, 2023
Maintainer

crackerjam Aug 22, 2023
Author

mulkieran Nov 27, 2023
Maintainer

yyshell
Nov 27, 2023

jbaublitz Nov 27, 2023
Maintainer

wushilin
Sep 8, 2024

jbaublitz Sep 17, 2024
Maintainer