Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to write saved data in a resilient manner #1001

Merged
merged 9 commits into from
Jul 17, 2024

Conversation

FiniteReality
Copy link
Member

I'm implementing a naive solution for saving NBT data in a resilient manner here to prevent data corruption when a server/game crashes during saving. This should prevent data attachments (especially large ones) from corrupting, as any unfinished writes will get directed to a temp file instead. These temp files are automatically cleaned on load, if detected, using a naive algorithm which likely has terrible performance characteristics.

This should resolve #775, at least partially. It would be better if we could guarantee the write goes through, but if the JVM hard crashes (e.g. access violation, power loss) I don't see a way around needing something like this.

@neoforged-pr-publishing
Copy link

neoforged-pr-publishing bot commented May 25, 2024

  • Publish PR to GitHub Packages

Last commit published: e8ac624b542217cac433f4f15517566bf5ab851a.

PR Publishing

The artifacts published by this PR:

Repository Declaration

In order to use the artifacts published by the PR, add the following repository to your buildscript:

repositories {
    maven {
        name 'Maven for PR #1001' // https://github.com/neoforged/NeoForge/pull/1001
        url 'https://prmaven.neoforged.net/NeoForge/pr1001'
        content {
            includeModule('net.neoforged', 'testframework')
            includeModule('net.neoforged', 'neoforge')
        }
    }
}

MDK installation

In order to setup a MDK using the latest PR version, run the following commands in a terminal.
The script works on both *nix and Windows as long as you have the JDK bin folder on the path.
The script will clone the MDK in a folder named NeoForge-pr1001.
On Powershell you will need to remove the -L flag from the curl invocation.

mkdir NeoForge-pr1001
cd NeoForge-pr1001
curl -L https://prmaven.neoforged.net/NeoForge/pr1001/net/neoforged/neoforge/21.0.107-beta-pr-1001-feature-resillient-io/mdk-pr1001.zip -o mdk.zip
jar xf mdk.zip
rm mdk.zip || del mdk.zip

To test a production environment, you can download the installer from here.

@FiniteReality
Copy link
Member Author

I'm not 100% sure if putting the patch in NbtIo is the best place, as there are certain SavedData implementations which themselves do this kind of resilient saving - the player's saved data is one example. It may be worth moving the implementation into SavedData directly, rather than NbtIo. Thoughts?

@Technici4n
Copy link
Member

I wanted to look into how vanilla handles this for its main level data file. There is some convoluted logic to make it work in many corner cases.

@FiniteReality
Copy link
Member Author

I wanted to look into how vanilla handles this for its main level data file. There is some convoluted logic to make it work in many corner cases.

Looking at it, it seems that it uses the same trick (RegionFile#writeToExternalFile creates a temp file, writes to it, and then moves it into place) for oversize chunks, but the main region file is simply a FileChannel created with StandardOpenOption.DSYNC which forces synchronous writes, and according to the NIO docs:

The SYNC and DSYNC options are used when opening a file to require that updates to the file are written synchronously to the underlying storage device. In the case of the default provider, and the file resides on a local storage device, and the seekable channel is connected to a file that was opened with one of these options, then an invocation of the write method is only guaranteed to return when all changes made to the file by that invocation have been written to the device. These options are useful for ensuring that critical information is not lost in the event of a system crash. If the file does not reside on a local device then no such guarantee is made. Whether this guarantee is possible with other provider implementations is provider specific.

My guess is that internally this acts as a write-through for any caches and buffers for the region file that's open. Everything else on top of this is just shuffling I/O around so that it doesn't happen on the main thread, as they become blocking writes.

Since we're writing to a temporary file and moving that into place, all buffers should be appropriately flushed before the move completes anyway. We could consider moving to an I/O pool but I don't think this is necessary, and would be a rather invasive patch in comparison.

@Technici4n
Copy link
Member

Technici4n commented May 26, 2024

Doesn't MC also have some recovery system where it attempts to read files that are updated but haven't been moved yet? If it's just falling back to LevelDirectory#oldDataFile it's not as fancy as I remember...

The atomic move should be good enough.

@FiniteReality
Copy link
Member Author

Doesn't MC also have some recovery system where it attempts to read files that are updated but haven't been moved yet?

I couldn't find anything fancy like that. I assume they couldn't do that anyway, because it would risk data loss; a partially completed save could still have partially valid data and load successfully, even though the older file is more likely to be correct.

@FiniteReality FiniteReality force-pushed the feature/resillient-io branch from fbc0902 to 94148f2 Compare June 2, 2024 20:50
@FiniteReality FiniteReality marked this pull request as ready for review June 2, 2024 20:51
@FiniteReality FiniteReality force-pushed the feature/resillient-io branch from 94148f2 to 4479ecb Compare June 5, 2024 18:53
@Matyrobbrt Matyrobbrt added the enhancement New (or improvement to existing) feature or request label Jun 11, 2024
@FiniteReality FiniteReality force-pushed the feature/resillient-io branch from 4479ecb to 36c2e54 Compare June 16, 2024 02:01
@FiniteReality FiniteReality changed the base branch from 1.20.x to 1.21.x June 16, 2024 02:02
@FiniteReality FiniteReality added the 1.21 Targeted at Minecraft 1.21 label Jun 16, 2024
@sciwhiz12 sciwhiz12 requested a review from Technici4n July 5, 2024 02:02
Technici4n
Technici4n previously approved these changes Jul 15, 2024
@Technici4n Technici4n requested a review from sciwhiz12 July 15, 2024 08:40
@FiniteReality FiniteReality removed the request for review from sciwhiz12 July 16, 2024 21:24
@shartte shartte self-requested a review July 16, 2024 21:26
Copy link
Member

@Technici4n Technici4n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does closing the channel not flush it already?

@shartte
Copy link
Contributor

shartte commented Jul 17, 2024

Does closing the channel not flush it already?

No. Not that we know of. You are thinking of java flush. We need OS-level fsync.

@Technici4n
Copy link
Member

Ok. Just the extra parentheses and we're good then.

@shartte shartte requested a review from Technici4n July 17, 2024 19:54
@shartte shartte enabled auto-merge (squash) July 17, 2024 19:56
@shartte shartte merged commit 2c72560 into neoforged:1.21.x Jul 17, 2024
4 checks passed
raoulvdberge added a commit to refinedmods/refinedstorage2 that referenced this pull request Jul 20, 2024
We no longer need to implement this
ourselves for NeoForge.
For Fabric, keep the existing patch.

See: neoforged/NeoForge#1001
raoulvdberge added a commit to refinedmods/refinedstorage2 that referenced this pull request Jul 20, 2024
We no longer need to implement this
ourselves for NeoForge.
For Fabric, keep the existing patch.

See: neoforged/NeoForge#1001
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.21 Targeted at Minecraft 1.21 enhancement New (or improvement to existing) feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more protections to SavedData for data corruption
5 participants