Repozo incremental recover #403

Sebatyne · 2024-10-22T02:35:36Z

Like for the --backup mode of repozo, add an incremental feature for the recovery of a backup to a Data.fs, which allows to only append the latest backup(s) increment(s) (.deltafs) to a previously recovered Data.fs, instead of trashing it and restarted the recovery process from scratch.

This feature becomes the new default behavior, but the new flag "--full" allows to fall back to the previous behavior.

A few checks are done while trying to recover incrementally (ie: on size, or on the latest increment checksum), and the code automatically falls back to the full-recovery mode if they fail. This would happen for exemple if the production data has been packed after the previous recovery.

The need for such feature arose from our own production use, where we create delta backups of a file storage every day, send them to a stand-by server, and rebuild the ZODB there (still every day). When the ZODB is bigger than 1Tb, the full recovery can take several hours, whereas the incremental recovery would take a few minutes only (often even less)

…on to the implementation of the incremental recover

@vpelletier

Which allows to recover a zodb filestorage by only appending the missing chunks from the latest recovered file, instead of always recovering from zero. Based on the work of @vpelletier (incpozo).

Sebatyne · 2024-10-22T02:43:04Z

@vpelletier , maybe you wish to review as you made the original work ?

@perrinjerome, you're the last contributor to repozo, could you review this PR ?

Thanks,

perrinjerome

Can you also add a change log entry ?

I just looked at the diff this time, I will try to actually try running repozo with this patch later

src/ZODB/scripts/repozo.py

Sebatyne · 2024-10-22T05:22:24Z

Hello,

I would like to explain more my reasoning about the new behavior, as a change of default can be surprising.

A good practice with a backup/recovery plan is to check that the backed-up data can be restored on a remote service. That's why we recover the Delta.fs every day, to check that the latest .deltafs increment (which is the only new backed-up file every day, as the other .deltafs and the original .fs are already synchronised on the remote site) is valid.

From this observation, as when we import the new increment, we already have the recovered Delta.fs from the previous day, it sounds a waste of resource to delete it, and rebuild it from 0. If we could simply recover the new increment on the existing Delta.fs, then its sum would be checked, proving its validity once and for all. And we don't need to check its validity every day, as a data corruption is most likely to happen during the write process or the network copy.

Also, I believe the time saved to not restore a full Data.fs is welcome, as it allows to decrease the time-to-recovery in case of activation of the disaster recovery plan, or simply to create backups more often, to decrease the quantity of lost data in a production incident.

Please feel free to ask me more questions.

Regards,

Nicolas

Sebatyne · 2024-10-22T05:43:18Z

Can you also add a change log entry ?

I just looked at the diff this time, I will try to actually try running repozo with this patch later

I have added an entry. But I'm not sure about the wording.

mgedmin

Overall I like this a lot. I've two small fixes to suggest.

src/ZODB/scripts/repozo.py

vpelletier · 2024-10-22T07:54:03Z

src/ZODB/scripts/repozo.py

+        log('Target file smaller than full backup, '
+            'falling back to a full recover.')
+        return do_full_recover(options, repofiles)
+    check_startpos = int(previous_chunk[1])


About checking the already-restored file, which is a new concept in this tool (and the corner stone of this new feature), should it be under control of --with-verify ? Should only the last chunk be checked when --quick is also provided ?

IMHO repozo should always check the MD5, and only the last chunk, except when the main action is --verify (which then should only be needed for full-output checks).

This is the kind of implementation decisions I was free to make as long as my implementation was separate, but becomes harder to decide once both tools are merged.

If I read correctly:

When doing backup, the logic to decide if an incremental backup is possible depends of --quick:

with --quick, first compare sizes and if sizes match, compare checksums for last increment

without --quick, compare full checksums

If we don't like the current logic to decide if an incremental restore is possible ( compare sizes and if sizes match, compare checksums for last restored increment ), I feel it would make sense to base the logic on --quick as well, because this is very "symmetric". That said, the current approach of verifying the checksum of the last increment seems fine.

During restore, if --with-verify is passed, with this patch, we verify only what is restored (ie. everything during a full and only the increments during an incremental), this also seems good to me.

vpelletier · 2024-10-22T08:12:51Z

src/ZODB/scripts/repozo.py

+    with open(options.output, 'r+b') as outfp:
+        outfp.seek(0, 2)
+        initial_length = outfp.tell()
+    with open(datfile) as fp:


Doesn't reopening the index risk disobeying find_files logic ? Especially, the --date option.

…eparation to the implementation of the incremental recover

Sebatyne · 2024-10-23T06:57:44Z

Sorry for the long list of "fixup!" commits, I didn't think it would get that long...

To implement the feedback received in this MR, and to prevent an error on windows because a same file was opened twice, I had to rework deeply the function do_incremental_recover. I hope it is not (too much...) an issue for the review.

I have added more assertions in acadc7a, as well as a new step where I delete an old .deltafs already recovered, to prove the correctness of the code, and that it doesn't fall back silently to the full-recovery mode. I hope it will help you trust the rewriting of do_incremental_recover that happened in the latest commits.

perrinjerome

I tried and it really seems good.

I have a suggestion about the output during restore, if I have a repository with these files:

        backups/2024-10-29-14-16-48.fs
        backups/2024-10-29-14-17-29.deltafs
        backups/2024-10-29-14-17-44.deltafs

and I have a recovered file until 2024-10-29-14-17-29.deltafs, when I run restore a second time, only 2024-10-29-14-17-44.deltafs should be restored incrementally, but the output is confusing:

$ repozo -v --recover --repository backups/ -o recovered/mydata.fs
looking for files between last full backup and 2024-10-29-14-17-46...
files needed to recover state as of 2024-10-29-14-17-46:
        backups/2024-10-29-14-16-48.fs
        backups/2024-10-29-14-17-29.deltafs
        backups/2024-10-29-14-17-44.deltafs
Recovering (incrementally) file to recovered/mydata.fs
Recovered 181 bytes, md5: 1cc0425a2866a20eed96571e1cdedc71
Restoring index file backups/2024-10-29-14-17-44.index to recovered/mydata.fs.index

files needed to recover state ... is slightly incorrect, these are the files needed for a full backup, for an incremental backup only backups/2024-10-29-14-17-44.deltafs is needed.

What we could do easily is also list the files that will actually be restored, with a patch like this

diff --git a/src/ZODB/scripts/repozo.py b/src/ZODB/scripts/repozo.py
index 21e38289..b3e86a36 100755
--- a/src/ZODB/scripts/repozo.py
+++ b/src/ZODB/scripts/repozo.py
@@ -801,6 +801,9 @@ def do_incremental_recover(options, repofiles):
     assert first_file_to_restore > 0, (
         first_file_to_restore, options.repository, fn, filename, repofiles)
 
+    log('remaining files needed to recover incrementally:')
+    for f in repofiles[first_file_to_restore:]:
+        log('\t%s', f)
     temporary_output_file = options.output + '.part'
     os.rename(options.output, temporary_output_file)
     with open(temporary_output_file, 'r+b') as outfp:

the output would become:

$ repozo -v --recover --repository backups/ -o recovered/mydata.fs
looking for files between last full backup and 2024-10-29-14-17-46...
files needed to recover state as of 2024-10-29-14-17-46:
        backups/2024-10-29-14-16-48.fs
        backups/2024-10-29-14-17-29.deltafs
        backups/2024-10-29-14-17-44.deltafs
Recovering (incrementally) file to recovered/mydata.fs
remaining files needed to recover:
        backups/2024-10-29-14-17-44.deltafs
Recovered 181 bytes, md5: 1cc0425a2866a20eed96571e1cdedc71
Restoring index file backups/2024-10-29-14-17-44.index to recovered/mydata.fs.index

we could maybe do better if we change in find_files, but this looks enough.

perrinjerome · 2024-10-29T13:08:41Z

CHANGES.rst

+- Support incremental recovery in repozo.
+  It makes it much faster in a day-to-day scenario.


What do you think of something a bit longer, similar to this ?

Suggested change

- Support incremental recovery in repozo.

It makes it much faster in a day-to-day scenario.

- repozo: Change restoration to be incremental by default, unless ``--full`` is

provided.

Repozo now tries to append the new incremental deltafs on previously restored

filestorage, if the file sizes and the checksum of the last restored increment

match, otherwise it will fallback to a full recover.

For details see `#403 <https://github.com/zopefoundation/ZODB/pull/403>`_.

perrinjerome · 2024-10-29T13:14:16Z

src/ZODB/scripts/repozo.py

+    datfile = os.path.splitext(repofiles[0])[0] + '.dat'
+    log('Recovering (incrementally) file to %s', options.output)
+    with open(options.output, 'r+b') as outfp:
+        outfp.seek(0, 2)


do you think this comment is correct ? here we don't use getSize like during backup, so it might look like a mistake, but it's not.
I also suggest changing the "magic" 2 to os.SEEK_END.

Suggested change

outfp.seek(0, 2)

# Note that we do not open the FileStorage to use getSize here,

# we really want the actual file size, even if there is invalid

# transaction data at the end.

outfp.seek(0, os.SEEK_END)

perrinjerome · 2024-10-29T13:21:31Z

src/ZODB/scripts/repozo.py

+        log('Target file smaller than full backup, '
+            'falling back to a full recover.')
+        return do_full_recover(options, repofiles)
+    check_startpos = int(previous_chunk[1])


If I read correctly:

When doing backup, the logic to decide if an incremental backup is possible depends of --quick:

with --quick, first compare sizes and if sizes match, compare checksums for last increment

without --quick, compare full checksums

If we don't like the current logic to decide if an incremental restore is possible ( compare sizes and if sizes match, compare checksums for last restored increment ), I feel it would make sense to base the logic on --quick as well, because this is very "symmetric". That said, the current approach of verifying the checksum of the last increment seems fine.

During restore, if --with-verify is passed, with this patch, we verify only what is restored (ie. everything during a full and only the increments during an incremental), this also seems good to me.

Sebatyne added 2 commits October 21, 2024 16:38

repozo: factorize code doing the actual recover (write), in preparati…

6543901

…on to the implementation of the incremental recover

repozo: support incremental recover

f62057c

Which allows to recover a zodb filestorage by only appending the missing chunks from the latest recovered file, instead of always recovering from zero. Based on the work of @vpelletier (incpozo).

Sebatyne force-pushed the repozo-incremental-recover branch from 4be516b to f62057c Compare October 22, 2024 02:40

perrinjerome reviewed Oct 22, 2024

View reviewed changes

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved

Sebatyne added 2 commits October 22, 2024 14:37

fixup! repozo: support incremental recover

fb51978

fixup! repozo: support incremental recover

154c47c

fixup! repozo: support incremental recover

d441b83

mgedmin requested changes Oct 22, 2024

View reviewed changes

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved

src/ZODB/scripts/repozo.py Outdated Show resolved Hide resolved

vpelletier reviewed Oct 22, 2024

View reviewed changes

Sebatyne added 2 commits October 22, 2024 18:13

fixup! repozo: factorize code doing the actual recover (write), in pr…

67ab5a6

…eparation to the implementation of the incremental recover

fixup! repozo: support incremental recover

c485afb

mgedmin self-requested a review October 22, 2024 11:51

Sebatyne added 5 commits October 23, 2024 11:40

fixup! fixup! repozo: support incremental recover

240f6bb

fixup! repozo: support incremental recover

7291213

fixup! repozo: support incremental recover

27d6296

fixup! repozo: support incremental recover

acadc7a

fixup! repozo: support incremental recover

d0adb00

perrinjerome reviewed Oct 29, 2024

View reviewed changes

Merge branch 'master' into repozo-incremental-recover

315c76b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repozo incremental recover #403

Repozo incremental recover #403

Sebatyne commented Oct 22, 2024 •

edited

Loading

Sebatyne commented Oct 22, 2024

perrinjerome left a comment

Sebatyne commented Oct 22, 2024

Sebatyne commented Oct 22, 2024

mgedmin left a comment

vpelletier Oct 22, 2024

perrinjerome Oct 29, 2024

vpelletier Oct 22, 2024

Sebatyne commented Oct 23, 2024

perrinjerome left a comment

perrinjerome Oct 29, 2024

perrinjerome Oct 29, 2024

perrinjerome Oct 29, 2024

		- Support incremental recovery in repozo.
		It makes it much faster in a day-to-day scenario.

-- Support incremental recovery in repozo.
-  It makes it much faster in a day-to-day scenario.
+- repozo: Change restoration to be incremental by default, unless ``--full`` is
+  provided.
+  Repozo now tries to append the new incremental deltafs on previously restored
+  filestorage, if the file sizes and the checksum of the last restored increment
+  match, otherwise it will fallback to a full recover.
+  For details see `#403 <https://github.com/zopefoundation/ZODB/pull/403>`_.

-        outfp.seek(0, 2)
+        # Note that we do not open the FileStorage to use getSize here,
+        # we really want the actual file size, even if there is invalid
+        # transaction data at the end.
+        outfp.seek(0, os.SEEK_END)

Repozo incremental recover #403

Are you sure you want to change the base?

Repozo incremental recover #403

Conversation

Sebatyne commented Oct 22, 2024 • edited Loading

Sebatyne commented Oct 22, 2024

perrinjerome left a comment

Choose a reason for hiding this comment

Sebatyne commented Oct 22, 2024

Sebatyne commented Oct 22, 2024

mgedmin left a comment

Choose a reason for hiding this comment

vpelletier Oct 22, 2024

Choose a reason for hiding this comment

perrinjerome Oct 29, 2024

Choose a reason for hiding this comment

vpelletier Oct 22, 2024

Choose a reason for hiding this comment

Sebatyne commented Oct 23, 2024

perrinjerome left a comment

Choose a reason for hiding this comment

perrinjerome Oct 29, 2024

Choose a reason for hiding this comment

perrinjerome Oct 29, 2024

Choose a reason for hiding this comment

perrinjerome Oct 29, 2024

Choose a reason for hiding this comment

Sebatyne commented Oct 22, 2024 •

edited

Loading