Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoc: cleanup #531

Merged
merged 19 commits into from
Oct 11, 2023
Merged

epoc: cleanup #531

merged 19 commits into from
Oct 11, 2023

Conversation

joemfb
Copy link
Member

@joemfb joemfb commented Oct 4, 2023

This PR cleans up the new epoch system from #459, fixing some small bugs, plugging leaks, and simplifying the interface to it.

It still needs a final round of crash recovery testing (killing the process at every stage of the intialization/migration, confirming that subsequent restarts proceed as they should).

Resolves #530.

@joemfb joemfb requested a review from a team as a code owner October 4, 2023 04:05
@joemfb joemfb changed the title Jb/epoc cleanup epoc: cleanup Oct 4, 2023
@matthew-levan
Copy link
Contributor

Spun up a fakezod with this, and probably unrelated, but getting a lot of:

http: fail (13, 504): I/O error (body; content-length)
http: fail (15, 504): I/O error (body; content-length)

matthew-levan
matthew-levan previously approved these changes Oct 6, 2023
Copy link
Contributor

@matthew-levan matthew-levan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching my errors and cleaning up the initialization and migration control flows. Happy to see those c3_unlink return code checks fixed, the removal of u3_Host.eve_d, a well-named and well purposed u3_disk_kindly function, and the reduction of exposure of disk functions in the rest of the codebase (like pier.c, for example). Approved.

pkg/vere/disk.c Outdated Show resolved Hide resolved
@joemfb
Copy link
Member Author

joemfb commented Oct 6, 2023

These last few commits reorder epoc creation and add fsyncs for atomicity (ie, if the epoch version file exists, we know the snapshot was fully copied). Then it modifies how the latest epoch is loaded, deleting it and falling back to the prior epoch if it was not completely initialized.

More granular error handling on load is possible (for instance, copying metadata from the prior epoch if it's the only thing that's missing, which would be more efficient), but these changes handle the base case of all errors where it's clear that we can recover.

@matthew-levan
Copy link
Contributor

matthew-levan commented Oct 7, 2023 via email

Copy link
Contributor

@matthew-levan matthew-levan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More good stuff. I booted a fresh fakezod with this build. I was able to then use it and crash ./urbit roll zod before it could complete. The contents of the second epoch 0i98, at the point, had only an incomplete copy of the [north|south].bin files. I then tested how both ./urbit roll zod and ./urbit zod behave afterwards, and both were able to recover gracefully (with helpful messages too).

Approved.

P.S. For my edification, why did you implement try_init as a goto?

@pkova pkova merged commit 9bdc1af into develop Oct 11, 2023
5 checks passed
@pkova pkova deleted the jb/epoc-cleanup branch October 11, 2023 15:06
@joemfb
Copy link
Member Author

joemfb commented Oct 11, 2023

@matthew-levan thanks for the review. it could've been a loop, but there's only one case that continues around a second time, and that case only retries once. that's kind of a strange pattern of repetition, and it seemed better to be explicit.

joemfb added a commit that referenced this pull request Nov 8, 2023
This PR resolves a bug that I introduced in #531. I added stronger
validation during event log initialization by checking for the presence
of metadata, but that was allocating on the loom (in the case of 32-bit
or greater ships and/or lives). There are various chicken-and-egg
problems with initializing old event logs for replay, so it seems best
to just remove this whole operation from the loom.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gracefully handle cases where urbit roll didn't complete successfully
3 participants