-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
epoc: cleanup #531
epoc: cleanup #531
Conversation
Spun up a fakezod with this, and probably unrelated, but getting a lot of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching my errors and cleaning up the initialization and migration control flows. Happy to see those c3_unlink
return code checks fixed, the removal of u3_Host.eve_d
, a well-named and well purposed u3_disk_kindly
function, and the reduction of exposure of disk functions in the rest of the codebase (like pier.c
, for example). Approved.
These last few commits reorder epoc creation and add fsyncs for atomicity (ie, if the epoch version file exists, we know the snapshot was fully copied). Then it modifies how the latest epoch is loaded, deleting it and falling back to the prior epoch if it was not completely initialized. More granular error handling on load is possible (for instance, copying metadata from the prior epoch if it's the only thing that's missing, which would be more efficient), but these changes handle the base case of all errors where it's clear that we can recover. |
ca70966
to
c7302ed
Compare
Awesome, will review first thing Monday.
…On Fri, Oct 6, 2023 at 4:55 PM Joe Bryan ***@***.***> wrote:
These last few commits reorder epoc creation and add fsyncs for atomicity
(ie, if the epoch version file exists, we know the snapshot was fully
copied). Then it modifies how the latest epoch is loaded, deleting it and
falling back to the prior epoch if it was not completely initialized.
More granular error handling on load is possible (for instance, copying
metadata from the prior epoch if it's the only thing that's missing, which
would be more efficient), but these changes handle the base case of all
errors where it's clear that we can recover.
—
Reply to this email directly, view it on GitHub
<#531 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AV2DQRCSKO2NKXJ3KKPWIQ3X6BV3TAVCNFSM6AAAAAA5R632COVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJRGM4TGMJYGE>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
c7302ed
to
d7ea982
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More good stuff. I booted a fresh fakezod with this build. I was able to then use it and crash ./urbit roll zod
before it could complete. The contents of the second epoch 0i98
, at the point, had only an incomplete copy of the [north|south].bin
files. I then tested how both ./urbit roll zod
and ./urbit zod
behave afterwards, and both were able to recover gracefully (with helpful messages too).
Approved.
P.S. For my edification, why did you implement try_init
as a goto?
@matthew-levan thanks for the review. it could've been a loop, but there's only one case that continues around a second time, and that case only retries once. that's kind of a strange pattern of repetition, and it seemed better to be explicit. |
This PR resolves a bug that I introduced in #531. I added stronger validation during event log initialization by checking for the presence of metadata, but that was allocating on the loom (in the case of 32-bit or greater ships and/or lives). There are various chicken-and-egg problems with initializing old event logs for replay, so it seems best to just remove this whole operation from the loom.
This PR cleans up the new epoch system from #459, fixing some small bugs, plugging leaks, and simplifying the interface to it.
It still needs a final round of crash recovery testing (killing the process at every stage of the intialization/migration, confirming that subsequent restarts proceed as they should).
Resolves #530.