Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dolt gc --full: Implement a flag for GC which will collect everything, including the old gen. #8462

Merged
merged 12 commits into from
Oct 17, 2024

Conversation

reltuk
Copy link
Contributor

@reltuk reltuk commented Oct 17, 2024

Dolt GC is generational. By default, it moves all chunks reachable from commits into the old gen, and it leaves uncommitted data in the new gen. By default, it never revisits the chunks in the old gen. This means that if a branch with a lot of unique data exists during a GC, and then gets deleted, the storage taken up by that branch is never reclaimed by future garbage collection.

This PR implements dolt gc --full, which performs a collection on the entire database, ignoring the generational aspects of previous collections and visiting the entire set of live data. At the end of a successful dolt gc --full, only data which was reachable at the start of the run and data which has been written during the run will be present in the database.

Note: dolt gc --full visits every reachable chunk in the database and can be resource intensive. As a consequence of running dolt gc --full, any previously archived table files will be unarchived, and storage savings from the archiving will be lost. dolt gc --full involves copying chunks into new storage files, and fully materializing the new files before the old files are deleted; as a consequence, peak disk utilisation while running dolt gc --full can be up to 2x the existing size of the database.

…ns a finalizer.

Previously MarkAndSweepChunks would finalize itself by either appending the
table table files to the store or by swaping to the table files. Now it returns
a GCFinalizer instance which is used by the caller to effect the desired
change. This is necessary for implementing Full GC, where we actually want to
do both with the same GC result, but at different phases of the GC process.

This commit leaves the HasManyF return from NomsBlockStore.MarkAndSweepChunks
as a TODO.
…added to the store as part of the GC can be checked for their contents.
This does not yet work because the source ChunkStore is currently newGen. Still
reworking so that the generational store itself is the source.
Change ValueStore's handling of generational chunk stores, so that it
sweeps chunks from the generational store itself, and into new gen
and old gen. This will support the full gc access pattern where we need
to copy chunks from old gen into a new table file which will appear
in old gen itself.
… that dolt gc --full removes data from old gen which is now unreachable.
@coffeegoddd
Copy link
Contributor

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000
version result total
88a9a34 ok 5937457
version total_tests
88a9a34 5937457
correctness_percentage
100.0

@coffeegoddd
Copy link
Contributor

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000
version result total
077294f ok 5937457
version total_tests
077294f 5937457
correctness_percentage
100.0

@macneale4 macneale4 self-requested a review October 17, 2024 15:49
…sed by just fetching the chunks and re-compressing them.

This makes `dolt gc` work while sourcing the chunks from the generational store
and makes `dolt gc --full` work against databases which have been previously
archived. `dolt gc --full` will still completely undo the archiving in the
process of running the garbage collection.
@coffeegoddd
Copy link
Contributor

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000
version result total
6cfac1b ok 5937457
version total_tests
6cfac1b 5937457
correctness_percentage
100.0

Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit comments and questions. No show stoppers.

@@ -152,6 +152,32 @@ type LoggingChunkStore interface {

var ErrAddChunkMustBlock = errors.New("chunk keeper: add chunk must block")

type HasManyF func(ctx context.Context, hashes hash.HashSet) (absent hash.HashSet, err error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HasManyFunc would better, IMO. It's just 24bits man. come on.

Also, gDoc. Dangling function defs like this end up being abused if you don't.

if err != nil {
return err
}

root, err := lvs.Root(ctx)
if err != nil {
newGen.EndGC()
collector.EndGC()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you aren't simplifying this and doing a defer collector.EndGC() in both places where you BeginGC() ?

Looks like we always want to EndGC() so it would be more readable if you did

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a lot of nesting...but in converting it I found a place where we forgot the EndGC. So...this was definitely good feedback. Thanks for it.

@@ -139,7 +139,7 @@ func testGarbageCollection(t *testing.T, test gcTest) {
}
}

err := dEnv.DoltDB.GC(ctx, nil)
err := dEnv.DoltDB.GC(ctx, types.GCModeDefault, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are weird. Just an observation. I was thinking that I was going to ask for a test for the Full flag here, but it looks like these tests are basically just running Exec. Maybe bats is better than this?

sqlengine tests for this instead? Just sayin. These tests are weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think these tests are worth much compared to a bats test which does the same. Obviously they're supposed to target the DoltDB.GC invocation itself, and they're much easy to attach a debugger to or whatever, but AFAICT they're not providing much in the way of protection compared to existing layers of cheese.

@@ -343,7 +343,24 @@ func (ms *MemoryStoreView) EndGC() {
ms.transitionToNoGC()
}

func (ms *MemoryStoreView) MarkAndSweepChunks(ctx context.Context, hashes <-chan []hash.Hash, dest ChunkStore) error {
type msvGcFinalizer struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After GC, what does the memory store contain? This refactor looks like it preserves previous behavior, but I'm not sure why the previous behavior is the way it is. Educate me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, memory store is an in-memory chunk store implementation. After GC, memory store just contains all the visited chunks – that is, all the chunks whose address came through on the keepers channel.

go/store/types/value_store.go Show resolved Hide resolved
@@ -1037,6 +1037,34 @@ func (nbs *NomsBlockStore) HasMany(ctx context.Context, hashes hash.HashSet) (ha
return nbs.hasMany(toHasRecords(hashes))
}

func (nbs *NomsBlockStore) hasManyInSources(ctx context.Context, srcs []hash.Hash, hashes hash.HashSet) (hash.HashSet, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the ctx future proofing for when we might need it? No used currently. If you think the interface needs it, I'd drop the name and make it '_' so people know that this is intentional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just didn't realize it was unneeded. Removed! Thanks for that :).

go/store/nbs/table_set.go Show resolved Hide resolved
Copy link
Contributor

@macneale4 macneale4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                __/___            
          _____/______|           
  _______/_____\_______\_____     
  \              < < <       |    
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@coffeegoddd
Copy link
Contributor

@reltuk DOLT

comparing_percentages
100.000000 to 100.000000
version result total
0a53805 ok 5937457
version total_tests
0a53805 5937457
correctness_percentage
100.0

@reltuk reltuk merged commit d58a69a into main Oct 17, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants