flattened history

fraunhofer-izi · Oct 23, 2019 · 73ee1de · 73ee1de
commit 73ee1de
Show file tree

Hide file tree

Showing 19 changed files with 2,082 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,12 @@
+group
+*.out
+*.trash
+*.err
+*.lock
+.nfs*
+current_state
+
+bsdmainutils/
+du.sh
+disk_usage.sh
+update_logs/
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,204 @@
+# Introduction
+
+Utilities that help to find duplicate directories or files
+on large filesystems using slurm if present. All files
+will be hashed with sha256 in the process and results
+are stored in text files to enable easy access and prevent
+code injection.
+
+## Files
+
+We calculate hashes for all files with `sha256sum` and sort the result.
+The output is in the form `size,sha256,file-path`.
+Duplicate entries will be collected next to each other and
+large files will be at the bottom.
+
+## Directories
+
+In order to identify duplicate directories, we concatenate the sorted
+size, sha256 and name of all contained files and directories and apply sha256 o
+the resulting string. That leaves us with a hash sum per directory. The result
+will be sorted and filtered for duplicated hashes which identify
+potentially duplicate directories and exclude all pairs of directories that
+are not exactly identical.
+
+# Instruction
+
+Run `./update_file_hashes <dir1> <dir2> ...` to make or update the file hash
+tables. This can be run by all users of the group (s. Group Usage) from
+anywhere on the server. If you run the script without arguments, it reports the
+state of any currently running hash update.
+
+In order to use the tables to find duplicate
+files or directories you can run `./update_dupes`.
+This utility should only be run by the maintainer.
+
+The scripts queue a slurm job if slurm is available that
+that will wait for other jobs of the same repo.
+You can run jobs locally with the option `--local` or `-l`.
+
+Directories listed in `blacklist` are not included.
+
+The result tables are explained below.
+
+## Group Usage
+
+If you want all members of a given group to be able to update the hash table
+with `update_file_hashes`, write the group name to a file `group` in the
+same directory as `update_file_hashes` and protect it from manipulation.
+
+## Freeable space
+
+After finding the duplicates estimate for the total amount of disc space that
+could be saved by removing all duplicate files can be calculated with
+`./sum_duplicate_size`.
+
+## Missing Duplicates
+
+By default, all sub-directories of directories, that have a duplicate, are
+removed in `dupes.out` because they are duplicates
+as well anyway. E.g., if `A` is a duplicate of `B` then all sub-dirs
+, e.g., `A/a` and `B/a` are duplicates of one another and
+will not be listed seperately.
+
+However, sometimes there is an independent duplicate
+of the sub-directory, e.g., `C/a` is a duplicate of `A/a` and `B/a`
+but `C` is **not** a duplicate of `A`. Then only `C/a` will be listed
+with no visible duplicate in `dubes.out`.
+
+Worse, if `C` has another duplicate `D` the independent duplications
+`A/a`=`B/a`=`C/a`=`D/a` will not be listed at all. But
+the duplication of the super-directories `A`=`B` and `C`=`D` will be
+listed. We consider this scenario to be a very rare case.
+
+If you want to make sure a directory or file has no more duplicates you
+are not aware of use `dupes_with_subs.out`!
+
+## Disk Usage Utility
+
+When all files are hashed one can us `du` to
+quickly calculate the byte size of any directory or file
+among the searched once with:
+```{bash}
+./du <dir1> <dir2> <dir3>/* ...
+```
+If `./update_dupes` was run recently you can use it with the
+option `-q` or `--quick`. This works faster for large directories.
+
+## Removing Directories
+
+If you want to remove selected directories from the tables you can also
+use the partial update scripts by setting the environment variable
+`PURGE`. This will reomve the entries for `<dir1> <dir2> ...`:
+```{bash}
+PURGE=yes ./update_file_hashes <dir1> <dir2> ...
+```
+
+## Difference
+
+Sometimes it is surprising that two very similar directories do not show up in
+`dupes.out` and also have different hashes in `dir_hashes.out`.  For such
+cases, you can use `diff` to find out why they differ. The utility gives you
+all files that are unique in a set of directories.  To get all files that
+occure only once either in `<dir1>` or `<dir2>` use
+```{bash}
+./diff <dir1> <dir2>
+```
+If two files with the same name are listed they probably differ in
+the sub-directory possition, size or hash sum.
+
+## Show Duplicates
+
+One way is to browse the `human_dupes.out` result table where the largest
+duplicates are listed first. If you want to list all duplicates of a given path
+you can use the `./dupes` utility. It returns the duplicate files in the format
+`<size>,<hash>,<path>`.
+ - `./dupes <path1> <path2> ...` returns one line per duplicate file and the
+   input path as a descriptive title to each set of duplicates. The returned
+   duplicates do not include the given paths themselves.
+ - `./dupes <tag1> <tag2> ...` returns all files with the given tag of the
+   format `<size>,<hash>` and the tag as a descriptive title. This is much
+   faster than using paths.
+ - `./dupes <tag1>,<path1> <tag2>,<path2> ...` returns all duplicates of the
+   given files. That does not include the given paths themselves and is as fast
+   as using tags.
+ - `./dupes -r <dir1> <dir2> ...` returns all files inside the given
+   directories that have a hashed duplicate somewhere in the file system.
+
+The listed formats of the arguments can also be mixed.
+
+Instead of given arguments, you can also pipe them in. One use case is to
+look for all duplicates **to** the non unique files in the given directories with
+```{bash}
+./dupes -r <dir1> <dir2> ... | ./dupes
+```
+and if you want to list all the duplicates including those inside the given
+directories you can pass only the tags with
+```{bash}
+./dupes -r <dir1> <dir2> ... | cut -d , -f 1,2 | ./dupes
+```
+
+## Emulate sha256deep
+
+The utility `./hashdeep <dir>` uses the table of hashed files to quickly emulate
+the output of [sha256deep](http://md5deep.sourceforge.net/start-md5deep.html)
+`sha256deep <dir>`.
+
+## Logs
+
+All calls of `update_file_hashes` and `update_dupes` are logged in `./update_logs/`.
+
+# Result Tables
+
+The results are used in some of the utilities above. Featured tables are:
+ - `file_hashes.out` All hashed files in the format `<size in byte>,<sha256sum>,<path>`.
+ - `sorted_file_hashes.out` Hashed files sorted by path in the format `<path>,<size in byte>,<sha256sum>`.
+ - `dir_hashes.out` All directory hashes (available only after `update_dupes`).
+ - `dupes.out` Duplicates without entries inside of duplicated directories.
+ Sorted from large to small.
+ - `human_dupes.out` As above with human readable sizes.
+ - `dupes_with_subs.out` As `dupes.out` with **all** duplicate files and directiries
+ and sorted with `LC_COLLATE=C` for fast lookup with the `./look` utility.
+
+
+# Dependencies
+
+All dependencies come with most linux distributions but the shipped version of
+`look` from [bsdmainutils](https://packages.debian.org/de/sid/bsdmainutils)
+often comes with a bug that does not allow to work with files larger than 2GB.
+This repo comes with a patched version that was compiled on Ubuntu 18.04 x86_64.
+If you have issues running it please compile your own patched
+[bsdmainutils-look](https://github.com/stuartraetaylor/bsdmainutils-look)
+and replace the file `./look` with your binary or a link to it.
+
+Other dependencies and their version we teted with are
+ - bash 4.4.20
+ - bc 1.07.1
+ - GNU Awk 4.1.4
+ - GNU coreutils 8.28
+ - GNU parallel 20161222
+ - GNU sed 4.4
+ - util-linux 2.31.1
+
+If available we also support
+ - slurm-wlm 17.11.2
+
+# Note
+
+We use GNU parallel:
+*O. Tange (2018): GNU Parallel 2018, March 2018, https://doi.org/10.5281/zenodo.1146014.*
+
+# Code Maintainer
+
+The maintainer must be aware that filenames in Linux can contain any character
+except the null `\x00` and the forward slash `/`. This can complicate the
+processing of the hash tables and breaks many common text processing solutions
+in rare cases. Since there are many users with many files on the system those
+cases tend to exist.
+
+Another curiosity is the output of `sha256` for wired filenames. You can test
+this with `touch "a\b"; sha256 "a\b"`. The backslash in the filename is escaped
+and strangely, the hash sum starts with a backslash which is not part of the
+correct sum for files with size 0. This behavior is explained
+[here](https://unix.stackexchange.com/questions/313733/various-checksums-utilities-precede-hash-with-backslash).
+
diff --git a/cronjob b/cronjob
@@ -0,0 +1,20 @@
+#! /bin/bash -
+
+dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))"
+cd "$dir"
+source utilities.sh
+
+last=$(find update_logs | sort -t. -k2 -n | tail -n1 | \
+    sed 's|.*/\([^.]*\)\..*|\1|') 2> /dev/null
+running=$(get_running --name="finding duplicates")
+
+if [[ "$last" == "update_dupes" ]]; then
+    >&2 printf 'There are no logs after the last run of update_dupes.\n'
+    >&2 printf 'The cronjob will be skipped.\n'
+elif [[ ! -z "$running" ]]; then
+    >&2 printf 'An instance of update_dupes with slurm id '
+    >&2 printf '%s is already runnign.\n' "$running"
+    >&2 printf 'The cronjob will be skipped.\n'
+else
+    ./update_dupes
+fi
diff --git a/diff b/diff
@@ -0,0 +1,52 @@
+#! /bin/bash -
+
+# Returns the files that are unique among the dirs
+if [[ "$1" == "-h" ]] || [[ "$1" == "--help" ]]; then
+    >&2 printf "Usage: $0 <dir1> <dir2> ...\n"
+    exit 0
+fi
+
+dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))"
+export LC_ALL=C # byte-wise sorting
+export GREP_COLOR='32'
+export look="$dir/look"
+export sorted_filehashes="$dir/sorted_file_hashes.out"
+
+tempHashes=$(mktemp -p /dev/shm)
+tempOut=$(mktemp -p /dev/shm)
+trap "rm $tempHashes $tempOut" EXIT
+
+getFileHashes()(
+    path="$(readlink -f "$1")"
+    path="${path//\\/\\\\}"
+    path="${path//$'\n'/\\n}"
+    escaped=$(printf '%s' "$path" | \
+        sed -r 's/([\$\.\*\/\[\\^])/\\\1/g' | \
+        sed 's/[]]/\[]]/g')
+    replace=$(printf '%s' "$path" | \
+        sed -r 's/(["&\/\\])/\\\1/g')
+    modLine(){ sed "s|^$escaped|$replace/|g"; }
+    tempFile=$(mktemp -p /dev/shm)
+    trap "rm $tempFile" RETURN
+    "$look" "$path/" "$sorted_filehashes" | \
+        modLine > $tempFile
+    if [[ $? -ne 0 || ! -s $tempFile ]]; then
+        # fallback since look can not deal with special chars
+        text='\e[33mWarning:\e[0m No matches -> using slower grep for %s\n'
+        >&2 printf "$text" "$path"
+        grep -aF "$path" "$sorted_filehashes" | \
+        modLine > $tempFile
+    fi
+    if [[ ! -s $tempFile ]]; then
+        >&2 printf '\e[33mWarning:\e[0m No entries found for %s\n' "$path"
+    fi
+    cat $tempFile
+)
+export -f getFileHashes
+
+printf '%s\0' "$@" | parallel --will-cite -0 -k getFileHashes | tee $tempOut | \
+    sed 's|^.*//||g' | sort | uniq -u > $tempHashes
+ColorEsc=$(printf '\e[0m')
+grep -aFf $tempHashes --color=always $tempOut | \
+    sed "s|//|/|g; s/,[^,]*,[^,]*$/$ColorEsc/g"
+
diff --git a/du b/du
@@ -0,0 +1,73 @@
+#! /bin/bash -
+
+# This script uses the sorted_file_hases.out to
+# quickly estimate file and directory sizes of
+# all passed paths. It also exepts globbings.
+# Usage: ./du <path1> <path2> ...
+
+export LC_ALL=C # byte-wise sorting
+export OPTERR=0 # silent getopts
+dir="$(dirname $(readlink -f ${BASH_SOURCE[0]}))"
+export look="$dir/look"
+export sorted_filehashes="$dir/sorted_file_hashes.out"
+export dirhashes="$dir/dir_hashes.out"
+
+read -d '' help_message << EOF
+Usage $0 [-q, --quick] <path1> <path2> ...
+
+-q, --quick : Quick mode (requires dir_hashes e.g. by update_dupes).
+EOF
+
+disp_help(){
+    printf '%s\n' "$help_message"
+}
+
+get_func="slow_getsize"
+while getopts "h?q-:" opt; do
+    case "$opt" in
+    -) case "${OPTARG}" in
+        quick) QUICK=TRUE;;
+        help) disp_help
+            exit 0;;
+        *)  disp_help
+            exit 1;;
+        esac ;;
+    q)  get_func="quick_getsize";;
+    h)  disp_help
+        exit 0;;
+    *)  disp_help
+        exit 1;;
+    esac
+done
+
+slow_getsize(){
+    f="$(realpath "$*")"
+    path="${f//\\/\\\\}"
+    path="${path//$'\n'/\\n}"
+    size=$(cat <("$look" "$path" "$sorted_filehashes") <(printf "0,0,0") | \
+        parallel --will-cite --pipe \
+        "sed -e 's/.*,\([0-9]*\),[^,]*$/\1/g' -e '/[^0-9]/d'" | paste -sd+ | bc)
+    size=$(numfmt --to=iec $size)
+    printf '%s\t%s\n' "$size" "$*"
+}
+quick_getsize(){
+    f="$(realpath "$*")"
+    path="${f//\\/\\\\}"
+    path="${path//$'\n'/\\n}"
+    if [[ -d "$f" ]]; then
+        SIZE="$(grep -F ",$path/" "$dirhashes" | cut -d, -f1 | \
+            awk 'BEGIN{a=0}{if ($1>0+a) a=$1} END{print a}')"
+    elif [[ -f "$f" ]]; then
+        SIZE=$("$look" "$path," "$sorted_filehashes" | head -n1 | \
+            sed -n 's/^.*,\([0-9]*\),[^,]*$/\1/p')
+    else
+        m="Error: Type of %s cannot be determined since it does not exist.\\n"
+        >&2 printf "$m" "$*"
+        return
+    fi
+    [[ -z "$SIZE" ]] && SIZE="0" || SIZE="$(numfmt --to=iec "$SIZE")"
+    printf '%s\t%s\n' "$SIZE" "$*"
+}
+export -f slow_getsize quick_getsize
+
+printf '%s\0' "${@:$OPTIND}" | parallel --will-cite -k -0 "$get_func"