Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to ignore hardlinks in dsync, dcmp and dwalk #565

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rezib
Copy link
Contributor

@rezib rezib commented Nov 17, 2023

This is a proposal to add -H, --nohardlink option on dsync, dcmp and dwalk to ignore hardlinks when walking in files tree.

The rationale is to avoid producing multiple copies of the same inodes which could result in synchronized files tree requiring much more storage consumption.

The corresponding manpages are also updated accordingly.

rezib and others added 6 commits November 17, 2023 15:31
Add flag option nohardlink in mfu_walk_opts_t structure to add the
possibility to ignore hardlinks in walk_stat_process()

Signed-off-by: Gaël Delbary <[email protected]>
Co-authored-by: Gaël Delbary <[email protected]>
Signed-off-by: Rémi Palancher <[email protected]>
Add --nohardlink option to ignore hardlink when walking in files tree.

This option adds the possibility to avoid copying hardlinks as regular
files which could cause significant increase in storage consumption.
Users then have the possibility to redefine ignored hardlinks using
another mechanism.

Signed-off-by: Gaël Delbary <[email protected]>
Co-authored-by: Gaël Delbary <[email protected]>
Signed-off-by: Rémi Palancher <[email protected]>
Add --nohardlink option to ignore hardlink when walking in files tree.

Signed-off-by: Gaël Delbary <[email protected]>
Co-authored-by: Gaël Delbary <[email protected]>
Signed-off-by: Rémi Palancher <[email protected]>
Add --nohardlink option to ignore hardlink when walking in files tree.

Signed-off-by: Gaël Delbary <[email protected]>
Co-authored-by: Gaël Delbary <[email protected]>
Signed-off-by: Rémi Palancher <[email protected]>
Mention new -H, --nohardlink option in dsync, dcmp and dwalk manpages.

Signed-off-by: Rémi Palancher <[email protected]>
Signed-off-by: Rémi Palancher <[email protected]>
@adilger
Copy link
Contributor

adilger commented Nov 20, 2023

Not that if the source filesystem is Lustre and the client is mounted with user_fid2path or as root, you can use lfs path2fid --parents (or llapi_path2parent() equivalent) to generate a list of up to 100 parent directory FIDs for a hard linked file, and/or lfs fid2path (or llapi_fid2path() equivalent) to generate the pathnames for the hard links to a file.

This would allow efficiently maintaining the hard links in the target filesystem without having to make a full separate copy of the file, or scan the source tree trying to find the links.

Alternately, I believe tar will keep an in-memory list of inode numbers with hard links and if they are encountered again during tree traversal it will store a hard link instead of the full file.

@cedeyn
Copy link

cedeyn commented Nov 28, 2023

Not that if the source filesystem is Lustre and the client is mounted with user_fid2path or as root, you can use lfs path2fid --parents (or llapi_path2parent() equivalent) to generate a list of up to 100 parent directory FIDs for a hard linked file, and/or lfs fid2path (or llapi_fid2path() equivalent) to generate the pathnames for the hard links to a file.

This would allow efficiently maintaining the hard links in the target filesystem without having to make a full separate copy of the file, or scan the source tree trying to find the links.

Alternately, I believe tar will keep an in-memory list of inode numbers with hard links and if they are encountered again during tree traversal it will store a hard link instead of the full file.

Hi @adilger ,
This patch came from CEA with the Lustre filesystem.
I'am totally agree, that's what we did, but we also need to exclude hardlinks to make a full copy of the filesystem without hardlinks and then apply your method.
This is a simple patch, the hardway would be to keep track of each inode in the mpifileutils tools and compare if we already copied this inode or not. If it's already present, make a hardlink, else make a copy.

@adammoody
Copy link
Member

adammoody commented Nov 29, 2023

Thanks for the patch @rezib , and thanks for the tip on the Lustre calls for hardlinks @adilger .

Yes, this looks simple enough to add, and I understand the need.

We could also look to add hardlink support for the copy. I need to think about this more, but I suspect we could support this across file systems in general by using DTCMP_Rankv(). For each file that has hardlinks (st.st_nlink > 1), I think during the walk we could add the path to a local list on each process as a (inode, path) pair. After the walk completes, we could then identify all paths that map to the same inode value using DTCMP_Rankv where we use inode as the sort key. The process that has rank 0 for a particular inode value after that operation would be responsible for copying the file, while those entries that are assigned rank > 0 could create hardlinks.

We could perhaps use the Lustre calls as an optimization.

This needs to be fleshed out more...

Assuming we do that, would this option still be useful in other cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants