Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borg2: build_chunkindex_from_repo is slow #8397

Closed
ThomasWaldmann opened this issue Sep 19, 2024 · 6 comments
Closed

borg2: build_chunkindex_from_repo is slow #8397

ThomasWaldmann opened this issue Sep 19, 2024 · 6 comments
Assignees
Milestone

Comments

@ThomasWaldmann
Copy link
Member

ThomasWaldmann commented Sep 19, 2024

Problem:

That function does a repository.list(), listing all the object IDs in the repo to build an in-memory chunkindex.

Because all objects are stored separately into a 2 levels deep dir structure, that are (1+)256+65536 listdir() calls in the worst case. Depending on store speed, connection latency, etc., that can take quite a while.

The in-memory chunkindex is currently not persisted to local cache.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Sep 19, 2024

Analysis:

There are only a few borg2 commands that remove objects from data/ in the store:

  • borg compact (deletes unused objects)
  • borg check --repair (deletes corrupted objects)
  • borg debug obj-delete (expert-only, rarely used)
  • borg repo-delete (deletes the complete repository)

Notably, these commands do NOT delete objects from data/:

  • borg delete (just kills the entry in archives/)
  • borg prune (basically calls borg delete internally)

So, the set of objects in data/ is always increasing until compact/check is run (we can ignore borg debug and borg repo-delete).

borg create must not assume a chunk is in the repo when it in fact isn't anymore, that would create a corrupt archive, referencing a non-existing object.

OTOH, storing a chunk into the repo that already exists in there (but we did not know) is only a performance issue, but otherwise not a problem.

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented Sep 19, 2024

Implementation idea:

  • uptodate = TBD
  • load chunkindex from cache if uptodate else (use build_chunkindex_from_repo + save to cache immediately)
  • update in-memory chunkindex (borg create adds new chunks)
  • save updated chunkindex to cache

Uptodate check and lockless operation (even if multiple borg of same user on same machine use the same repository) needs more thoughts.

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0.0rc1, 2.0.0b11 Sep 19, 2024
@ThomasWaldmann
Copy link
Member Author

Another idea:

  • borg compact needs to use repository.list() anyway (and has an exclusive lock), so it could build the list of objects ids (after it deleted objects due to garbage collection) and store them into the repository (cache/* maybe?) and also to local cache.
  • format could be a compacted ChunkIndex file
  • clients could load that ChunkIndex from local cache or repo.
  • borg check --repair would either have to update these caches or invalidate them IF it has deleted objects.

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 21, 2024
Uses ChunkIndex (a specialized, memory-efficient data
structure), so borg compact needs less memory now.

Also, compact builds an uptodate chunks index now,
which is saved to cache/chunks in the store, so every
borg can find it and fetch it from there instead of always
ad-hoc building the chunks index via repository.list(),
which can be rather slow.
@ThomasWaldmann ThomasWaldmann self-assigned this Sep 21, 2024
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 21, 2024
Uses ChunkIndex (a specialized, memory-efficient data
structure), so borg compact needs less memory now.

Also, compact builds an uptodate chunks index now,
which is saved to cache/chunks in the store, so every
borg can find it and fetch it from there instead of always
ad-hoc building the chunks index via repository.list(),
which can be rather slow.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 22, 2024
Uses ChunkIndex (a specialized, memory-efficient data
structure), so borg compact needs less memory now.

Also, compact builds an uptodate chunks index now,
which is saved to cache/chunks in the store, so every
borg can find it and fetch it from there instead of always
ad-hoc building the chunks index via repository.list(),
which can be rather slow.
ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 24, 2024
Uses ChunkIndex (a specialized, memory-efficient data
structure), so borg compact needs less memory now.

Also, compact builds an uptodate chunks index now,
which is saved to cache/chunks in the store, so every
borg can find it and fetch it from there instead of always
ad-hoc building the chunks index via repository.list(),
which can be rather slow.
@SpiritInAShell
Copy link

SpiritInAShell commented Sep 25, 2024

(I do not have any deep understanding of the internal structures so when assuming the following, I am only poking around.)

(borg2 beta10) (Ok, I see that this is closed. I will try the next beta as soon as possible.)

This is tested on a Hetzner storagebox. Accessing from a 100/50MBit Telekom fibre connection.

If I understand correctly, the process in this issue is logged as following?

{
  "type": "log_message",
  "time": 1727224246.2930982,
  "message": "[chan 0] listdir(b'/repo.borg2beta10/data/a3/9e')",
  "levelname": "DEBUG",
  "name": "paramiko.transport.sftp"
}

As far as I know (asked ChatGPT) the SFTP protocol does not have a "get me all subdirs". This process is painful slow.

The repo size is about 11GiB.

I mounted the remote directory with "sshfs" and did a time find and aborted it after 10 minutes (doing another run right now) which took 451 seconds (not knowing if sshfs used a cache when doing the second run)

I did a time rsync -avPi [user@server]:/[remote path] (same connection, output is similar to a find -ls listing) and that took about 8 seconds

As every directory read has a significant overhead: what would be if the data chunks were put in numerical directories, until a direcory is "full" or "saturated" with a defined number of files? Would reading a few directories with many files improve the process?

I see that the chunks are sorted into subdirs related to their filename's first 2 chars and that will be for a reason. But is it worth the read overhead of sftp protocol? In the end, you create a local client-side index anyway. Therefor you know which chunk is in which directory.

Maybe other access protocols like rclone (which I do not know at all) will not have these limitations and therefor the subdir sctructure will be an advantage.

@ThomasWaldmann
Copy link
Member Author

@SpiritInAShell I just merged some improvements, so please re-test with current master branch (best is to create a fresh repo).

@ThomasWaldmann
Copy link
Member Author

The problem with sftp is described there: borgbackup/borgstore#44

The only way to speed this up is to do less requests, which is what current master branch / next beta will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants