-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 73ee1de
Showing
19 changed files
with
2,082 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
group | ||
*.out | ||
*.trash | ||
*.err | ||
*.lock | ||
.nfs* | ||
current_state | ||
|
||
bsdmainutils/ | ||
du.sh | ||
disk_usage.sh | ||
update_logs/ |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
# Introduction | ||
|
||
Utilities that help to find duplicate directories or files | ||
on large filesystems using slurm if present. All files | ||
will be hashed with sha256 in the process and results | ||
are stored in text files to enable easy access and prevent | ||
code injection. | ||
|
||
## Files | ||
|
||
We calculate hashes for all files with `sha256sum` and sort the result. | ||
The output is in the form `size,sha256,file-path`. | ||
Duplicate entries will be collected next to each other and | ||
large files will be at the bottom. | ||
|
||
## Directories | ||
|
||
In order to identify duplicate directories, we concatenate the sorted | ||
size, sha256 and name of all contained files and directories and apply sha256 o | ||
the resulting string. That leaves us with a hash sum per directory. The result | ||
will be sorted and filtered for duplicated hashes which identify | ||
potentially duplicate directories and exclude all pairs of directories that | ||
are not exactly identical. | ||
|
||
# Instruction | ||
|
||
Run `./update_file_hashes <dir1> <dir2> ...` to make or update the file hash | ||
tables. This can be run by all users of the group (s. Group Usage) from | ||
anywhere on the server. If you run the script without arguments, it reports the | ||
state of any currently running hash update. | ||
|
||
In order to use the tables to find duplicate | ||
files or directories you can run `./update_dupes`. | ||
This utility should only be run by the maintainer. | ||
|
||
The scripts queue a slurm job if slurm is available that | ||
that will wait for other jobs of the same repo. | ||
You can run jobs locally with the option `--local` or `-l`. | ||
|
||
Directories listed in `blacklist` are not included. | ||
|
||
The result tables are explained below. | ||
|
||
## Group Usage | ||
|
||
If you want all members of a given group to be able to update the hash table | ||
with `update_file_hashes`, write the group name to a file `group` in the | ||
same directory as `update_file_hashes` and protect it from manipulation. | ||
|
||
## Freeable space | ||
|
||
After finding the duplicates estimate for the total amount of disc space that | ||
could be saved by removing all duplicate files can be calculated with | ||
`./sum_duplicate_size`. | ||
|
||
## Missing Duplicates | ||
|
||
By default, all sub-directories of directories, that have a duplicate, are | ||
removed in `dupes.out` because they are duplicates | ||
as well anyway. E.g., if `A` is a duplicate of `B` then all sub-dirs | ||
, e.g., `A/a` and `B/a` are duplicates of one another and | ||
will not be listed seperately. | ||
|
||
However, sometimes there is an independent duplicate | ||
of the sub-directory, e.g., `C/a` is a duplicate of `A/a` and `B/a` | ||
but `C` is **not** a duplicate of `A`. Then only `C/a` will be listed | ||
with no visible duplicate in `dubes.out`. | ||
|
||
Worse, if `C` has another duplicate `D` the independent duplications | ||
`A/a`=`B/a`=`C/a`=`D/a` will not be listed at all. But | ||
the duplication of the super-directories `A`=`B` and `C`=`D` will be | ||
listed. We consider this scenario to be a very rare case. | ||
|
||
If you want to make sure a directory or file has no more duplicates you | ||
are not aware of use `dupes_with_subs.out`! | ||
|
||
## Disk Usage Utility | ||
|
||
When all files are hashed one can us `du` to | ||
quickly calculate the byte size of any directory or file | ||
among the searched once with: | ||
```{bash} | ||
./du <dir1> <dir2> <dir3>/* ... | ||
``` | ||
If `./update_dupes` was run recently you can use it with the | ||
option `-q` or `--quick`. This works faster for large directories. | ||
|
||
## Removing Directories | ||
|
||
If you want to remove selected directories from the tables you can also | ||
use the partial update scripts by setting the environment variable | ||
`PURGE`. This will reomve the entries for `<dir1> <dir2> ...`: | ||
```{bash} | ||
PURGE=yes ./update_file_hashes <dir1> <dir2> ... | ||
``` | ||
|
||
## Difference | ||
|
||
Sometimes it is surprising that two very similar directories do not show up in | ||
`dupes.out` and also have different hashes in `dir_hashes.out`. For such | ||
cases, you can use `diff` to find out why they differ. The utility gives you | ||
all files that are unique in a set of directories. To get all files that | ||
occure only once either in `<dir1>` or `<dir2>` use | ||
```{bash} | ||
./diff <dir1> <dir2> | ||
``` | ||
If two files with the same name are listed they probably differ in | ||
the sub-directory possition, size or hash sum. | ||
|
||
## Show Duplicates | ||
|
||
One way is to browse the `human_dupes.out` result table where the largest | ||
duplicates are listed first. If you want to list all duplicates of a given path | ||
you can use the `./dupes` utility. It returns the duplicate files in the format | ||
`<size>,<hash>,<path>`. | ||
- `./dupes <path1> <path2> ...` returns one line per duplicate file and the | ||
input path as a descriptive title to each set of duplicates. The returned | ||
duplicates do not include the given paths themselves. | ||
- `./dupes <tag1> <tag2> ...` returns all files with the given tag of the | ||
format `<size>,<hash>` and the tag as a descriptive title. This is much | ||
faster than using paths. | ||
- `./dupes <tag1>,<path1> <tag2>,<path2> ...` returns all duplicates of the | ||
given files. That does not include the given paths themselves and is as fast | ||
as using tags. | ||
- `./dupes -r <dir1> <dir2> ...` returns all files inside the given | ||
directories that have a hashed duplicate somewhere in the file system. | ||
|
||
The listed formats of the arguments can also be mixed. | ||
|
||
Instead of given arguments, you can also pipe them in. One use case is to | ||
look for all duplicates **to** the non unique files in the given directories with | ||
```{bash} | ||
./dupes -r <dir1> <dir2> ... | ./dupes | ||
``` | ||
and if you want to list all the duplicates including those inside the given | ||
directories you can pass only the tags with | ||
```{bash} | ||
./dupes -r <dir1> <dir2> ... | cut -d , -f 1,2 | ./dupes | ||
``` | ||
|
||
## Emulate sha256deep | ||
|
||
The utility `./hashdeep <dir>` uses the table of hashed files to quickly emulate | ||
the output of [sha256deep](http://md5deep.sourceforge.net/start-md5deep.html) | ||
`sha256deep <dir>`. | ||
|
||
## Logs | ||
|
||
All calls of `update_file_hashes` and `update_dupes` are logged in `./update_logs/`. | ||
|
||
# Result Tables | ||
|
||
The results are used in some of the utilities above. Featured tables are: | ||
- `file_hashes.out` All hashed files in the format `<size in byte>,<sha256sum>,<path>`. | ||
- `sorted_file_hashes.out` Hashed files sorted by path in the format `<path>,<size in byte>,<sha256sum>`. | ||
- `dir_hashes.out` All directory hashes (available only after `update_dupes`). | ||
- `dupes.out` Duplicates without entries inside of duplicated directories. | ||
Sorted from large to small. | ||
- `human_dupes.out` As above with human readable sizes. | ||
- `dupes_with_subs.out` As `dupes.out` with **all** duplicate files and directiries | ||
and sorted with `LC_COLLATE=C` for fast lookup with the `./look` utility. | ||
|
||
|
||
# Dependencies | ||
|
||
All dependencies come with most linux distributions but the shipped version of | ||
`look` from [bsdmainutils](https://packages.debian.org/de/sid/bsdmainutils) | ||
often comes with a bug that does not allow to work with files larger than 2GB. | ||
This repo comes with a patched version that was compiled on Ubuntu 18.04 x86_64. | ||
If you have issues running it please compile your own patched | ||
[bsdmainutils-look](https://github.com/stuartraetaylor/bsdmainutils-look) | ||
and replace the file `./look` with your binary or a link to it. | ||
|
||
Other dependencies and their version we teted with are | ||
- bash 4.4.20 | ||
- bc 1.07.1 | ||
- GNU Awk 4.1.4 | ||
- GNU coreutils 8.28 | ||
- GNU parallel 20161222 | ||
- GNU sed 4.4 | ||
- util-linux 2.31.1 | ||
|
||
If available we also support | ||
- slurm-wlm 17.11.2 | ||
|
||
# Note | ||
|
||
We use GNU parallel: | ||
*O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.* | ||
|
||
# Code Maintainer | ||
|
||
The maintainer must be aware that filenames in Linux can contain any character | ||
except the null `\x00` and the forward slash `/`. This can complicate the | ||
processing of the hash tables and breaks many common text processing solutions | ||
in rare cases. Since there are many users with many files on the system those | ||
cases tend to exist. | ||
|
||
Another curiosity is the output of `sha256` for wired filenames. You can test | ||
this with `touch "a\b"; sha256 "a\b"`. The backslash in the filename is escaped | ||
and strangely, the hash sum starts with a backslash which is not part of the | ||
correct sum for files with size 0. This behavior is explained | ||
[here](https://unix.stackexchange.com/questions/313733/various-checksums-utilities-precede-hash-with-backslash). | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#! /bin/bash - | ||
|
||
dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))" | ||
cd "$dir" | ||
source utilities.sh | ||
|
||
last=$(find update_logs | sort -t. -k2 -n | tail -n1 | \ | ||
sed 's|.*/\([^.]*\)\..*|\1|') 2> /dev/null | ||
running=$(get_running --name="finding duplicates") | ||
|
||
if [[ "$last" == "update_dupes" ]]; then | ||
>&2 printf 'There are no logs after the last run of update_dupes.\n' | ||
>&2 printf 'The cronjob will be skipped.\n' | ||
elif [[ ! -z "$running" ]]; then | ||
>&2 printf 'An instance of update_dupes with slurm id ' | ||
>&2 printf '%s is already runnign.\n' "$running" | ||
>&2 printf 'The cronjob will be skipped.\n' | ||
else | ||
./update_dupes | ||
fi |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
#! /bin/bash - | ||
|
||
# Returns the files that are unique among the dirs | ||
if [[ "$1" == "-h" ]] || [[ "$1" == "--help" ]]; then | ||
>&2 printf "Usage: $0 <dir1> <dir2> ...\n" | ||
exit 0 | ||
fi | ||
|
||
dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))" | ||
export LC_ALL=C # byte-wise sorting | ||
export GREP_COLOR='32' | ||
export look="$dir/look" | ||
export sorted_filehashes="$dir/sorted_file_hashes.out" | ||
|
||
tempHashes=$(mktemp -p /dev/shm) | ||
tempOut=$(mktemp -p /dev/shm) | ||
trap "rm $tempHashes $tempOut" EXIT | ||
|
||
getFileHashes()( | ||
path="$(readlink -f "$1")" | ||
path="${path//\\/\\\\}" | ||
path="${path//$'\n'/\\n}" | ||
escaped=$(printf '%s' "$path" | \ | ||
sed -r 's/([\$\.\*\/\[\\^])/\\\1/g' | \ | ||
sed 's/[]]/\[]]/g') | ||
replace=$(printf '%s' "$path" | \ | ||
sed -r 's/(["&\/\\])/\\\1/g') | ||
modLine(){ sed "s|^$escaped|$replace/|g"; } | ||
tempFile=$(mktemp -p /dev/shm) | ||
trap "rm $tempFile" RETURN | ||
"$look" "$path/" "$sorted_filehashes" | \ | ||
modLine > $tempFile | ||
if [[ $? -ne 0 || ! -s $tempFile ]]; then | ||
# fallback since look can not deal with special chars | ||
text='\e[33mWarning:\e[0m No matches -> using slower grep for %s\n' | ||
>&2 printf "$text" "$path" | ||
grep -aF "$path" "$sorted_filehashes" | \ | ||
modLine > $tempFile | ||
fi | ||
if [[ ! -s $tempFile ]]; then | ||
>&2 printf '\e[33mWarning:\e[0m No entries found for %s\n' "$path" | ||
fi | ||
cat $tempFile | ||
) | ||
export -f getFileHashes | ||
|
||
printf '%s\0' "$@" | parallel --will-cite -0 -k getFileHashes | tee $tempOut | \ | ||
sed 's|^.*//||g' | sort | uniq -u > $tempHashes | ||
ColorEsc=$(printf '\e[0m') | ||
grep -aFf $tempHashes --color=always $tempOut | \ | ||
sed "s|//|/|g; s/,[^,]*,[^,]*$/$ColorEsc/g" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
#! /bin/bash - | ||
|
||
# This script uses the sorted_file_hases.out to | ||
# quickly estimate file and directory sizes of | ||
# all passed paths. It also exepts globbings. | ||
# Usage: ./du <path1> <path2> ... | ||
|
||
export LC_ALL=C # byte-wise sorting | ||
export OPTERR=0 # silent getopts | ||
dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))" | ||
export look="$dir/look" | ||
export sorted_filehashes="$dir/sorted_file_hashes.out" | ||
export dirhashes="$dir/dir_hashes.out" | ||
|
||
read -d '' help_message << EOF | ||
Usage $0 [-q, --quick] <path1> <path2> ... | ||
-q, --quick : Quick mode (requires dir_hashes e.g. by update_dupes). | ||
EOF | ||
|
||
disp_help(){ | ||
printf '%s\n' "$help_message" | ||
} | ||
|
||
get_func="slow_getsize" | ||
while getopts "h?q-:" opt; do | ||
case "$opt" in | ||
-) case "${OPTARG}" in | ||
quick) QUICK=TRUE;; | ||
help) disp_help | ||
exit 0;; | ||
*) disp_help | ||
exit 1;; | ||
esac ;; | ||
q) get_func="quick_getsize";; | ||
h) disp_help | ||
exit 0;; | ||
*) disp_help | ||
exit 1;; | ||
esac | ||
done | ||
|
||
slow_getsize(){ | ||
f="$(realpath "$*")" | ||
path="${f//\\/\\\\}" | ||
path="${path//$'\n'/\\n}" | ||
size=$(cat <("$look" "$path" "$sorted_filehashes") <(printf "0,0,0") | \ | ||
parallel --will-cite --pipe \ | ||
"sed -e 's/.*,\([0-9]*\),[^,]*$/\1/g' -e '/[^0-9]/d'" | paste -sd+ | bc) | ||
size=$(numfmt --to=iec $size) | ||
printf '%s\t%s\n' "$size" "$*" | ||
} | ||
quick_getsize(){ | ||
f="$(realpath "$*")" | ||
path="${f//\\/\\\\}" | ||
path="${path//$'\n'/\\n}" | ||
if [[ -d "$f" ]]; then | ||
SIZE="$(grep -F ",$path/" "$dirhashes" | cut -d, -f1 | \ | ||
awk 'BEGIN{a=0}{if ($1>0+a) a=$1} END{print a}')" | ||
elif [[ -f "$f" ]]; then | ||
SIZE=$("$look" "$path," "$sorted_filehashes" | head -n1 | \ | ||
sed -n 's/^.*,\([0-9]*\),[^,]*$/\1/p') | ||
else | ||
m="Error: Type of %s cannot be determined since it does not exist.\\n" | ||
>&2 printf "$m" "$*" | ||
return | ||
fi | ||
[[ -z "$SIZE" ]] && SIZE="0" || SIZE="$(numfmt --to=iec "$SIZE")" | ||
printf '%s\t%s\n' "$SIZE" "$*" | ||
} | ||
export -f slow_getsize quick_getsize | ||
|
||
printf '%s\0' "${@:$OPTIND}" | parallel --will-cite -k -0 "$get_func" |
Oops, something went wrong.