Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue70 #74

Merged
merged 23 commits into from
Feb 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
4266ae5
feat: has 3 new arguments , and
kopardev Feb 5, 2024
37f6301
feat: adding '--outfile' argument
kopardev Feb 5, 2024
7b32b7f
chore: improving spacesavers2_catalog cli help
kopardev Feb 5, 2024
f2e7ccc
fix: logic update, new arguments fix#71
kopardev Feb 5, 2024
d0c5c52
fix: changes from mimeo trickling down
kopardev Feb 5, 2024
57ce249
chore: finddup replaced with mimeo etc.
kopardev Feb 5, 2024
c404e96
fix: updates for mimeo, blamematrix improvements
kopardev Feb 5, 2024
c149505
fix: improvements, new logic required for fixing #71
kopardev Feb 5, 2024
a8b7905
chore: help improvement
kopardev Feb 5, 2024
ef2ba91
feat: e2e complete overhaul
kopardev Feb 5, 2024
e5286f7
docs: updated docs to reflect new changes
kopardev Feb 5, 2024
a08c04e
docs: changelog updates
kopardev Feb 5, 2024
cf5744c
chore: adding dev prefix to version
kopardev Feb 5, 2024
6d7a2a2
chore:updating version
kopardev Feb 5, 2024
99b3c49
chore: updates to CHANGELOG
kopardev Feb 5, 2024
ccef883
updates
kopardev Feb 7, 2024
fe89fac
more fixes to not count files with the same inode as duplicates logic…
kopardev Feb 7, 2024
bdee6d3
docs: INFOLDER argument is called FOLDER in e2e CLI
kelly-sovacool Feb 7, 2024
56b8cad
docs: improve changelog notes for #74
kelly-sovacool Feb 7, 2024
1e57b5d
docs: usurp's change is due to grubbers output change for hardlinks
kelly-sovacool Feb 7, 2024
20ef4c2
Merge branch 'issue70' of https://github.com/CCBR/spacesavers2 into i…
kopardev Feb 7, 2024
f72923a
docs: more updates/corrections to e2e.md
kopardev Feb 7, 2024
615f1ca
refac: adding prefix to blamematrix output file
kopardev Feb 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 21 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,29 @@

### New features

- adding `requirements.txt` for easy creation of environment in "spacesavers2" docker (#68, @kopardev)
### Bug fixes

## Bug fixes
## spacesavers2 0.11.0

-
### New features

- Add `requirements.txt` for easy creation of environment in "spacesavers2" docker (#68, @kopardev)
- `grubbers` has new `--outfile` argument.
- `blamematrix` has now been moved into `mimeo`.
- `mimeo` files.gz always includes the original file as the first one in the filelist.
- `mimeo` now has kronatools compatible output. ktImportText is also run if in PATH to generate HTML report for duplicates only. (#46, @kopardev)
- Update documentation.

### Bug fixes

- `e2e` overhauled, improved and well commented.
- `grubbers` `--limit` can be < 1 GiB (float) (#70, @kopardev)
- `grubbers` output file format changed. New original file column added. Original file is required by `usurp`.
- `mimeo` `--duplicateonly` now correctly handles duplicates owned by different UIDs. (#71, @kopardev)
- Update `blamematrix` and to account for corrected duplicate handling in `mimeo`.
- `usurp` now uses the new "original file" column from `grubbers` while creating hard-links.
- Total size now closely resembles `df` results (fix #75 @kopardev)
- Files with future timestamps are handled correctly (fix #76, @kopardev)

## spacesavers2 0.10.2

Expand Down
8 changes: 1 addition & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,13 @@ Welcome! `spacesavers2`:

> New improved parallel implementation of [`spacesavers`](https://github.com/CCBR/spacesavers). `spacesavers` is soon to be decommissioned!

> Note: `spacesavers2` requires [python version 3.11](https://www.python.org/downloads/release/python-3110/) or later and the [xxhash](https://pypi.org/project/xxhash/) library. These dependencies are already installed on biowulf (as a conda env). The environment for running `spacesavers2` can get set up using:
>
> ```bash
> . "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh" && \
> conda activate py311
> ```
> Note: `spacesavers2` requires [python version 3.11](https://www.python.org/downloads/release/python-3110/) or later and the [xxhash](https://pypi.org/project/xxhash/) library. These dependencies are already installed on biowulf (as a conda env).

## `spacesavers2` has the following Basic commands:

- spacesavers2_catalog
- spacesavers2_mimeo
- spacesavers2_grubbers
- spacesavers2_blamematrix
- spacesavers2_e2e
- spacesavers2_usurp

Expand Down
Binary file modified docs/assets/images/spacesavers2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 0 additions & 38 deletions docs/blamematrix.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ Example:
`spacesavers2_catalog` creates one semi-colon seperated output line per input file. Here is an example line:

```bash
% head -n1 test.ls_out
% head -n1 test.catalog
"/data/CBLCCBR/kopardevn_tmp/spacesavers2_testing/_data_CCBR_Pipeliner_db_PipeDB_Indices.ls.old";False;1653453;47;372851499;1;1;5;5;37513;57886;4707e661a1f3beca1861b9e0e0177461;52e5038016c3dce5b6cdab635765cc79;
```
The 13 items in the line are as follows:
Expand Down
10 changes: 7 additions & 3 deletions docs/e2e.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,16 @@ End-to-end run of spacesavers2

options:
-h, --help show this help message and exit
-i INFOLDER, --infolder INFOLDER
Folder to run spacesavers_ls on.
-f FOLDER, --folder FOLDER
Folder to run spacesavers_catalog on.
-p THREADS, --threads THREADS
number of threads to use
-d MAXDEPTH, --maxdepth MAXDEPTH
maxdepth for mimeo
-l LIMIT, --limit LIMIT
limit for running spacesavers_grubbers
-q QUOTA, --quota QUOTA
total size of the volume (default = 200 for /data/CCBR)
-o OUTFOLDER, --outfolder OUTFOLDER
Folder where all spacesavers_finddup output files will be saved
Folder where all spacesavers_e2e output files will be saved
```
37 changes: 19 additions & 18 deletions docs/grubbers.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,38 @@
## spacesavers2_grubbers

This takes in the `.files.gz` generated by `spacesavers2_mimeo` and processes it to:
This takes in the `mimeo.files.gz` generated by `spacesavers2_mimeo` and processes it to:

- sort duplicates by total size
- reports the "high-value" duplicates.

Deleting these high-value duplicates first will have the biggest impact on the users overall digital footprint
Deleting these high-value duplicates first will have the biggest impact on the users overall digital footprint.

### Inputs

- `--filesgz` output file from `spacesavers2_mimeo`.
- `--limit` lower cut-off for output display (default 5 GiB). This means that duplicates with overall size of less than 5 GiB will not be displayed.
- `--limit` lower cut-off for output display (default 5 GiB). This means that duplicates with overall size of less than 5 GiB will not be displayed. Set 0 to report all.

```bash
% spacesavers2_grubbers --help
spacesavers2_grubbers:00000.01s:version: v0.5
usage: spacesavers2_grubbers [-h] -f FILESGZ [-l LIMIT]
╰─○ spacesavers2_grubbers --help
spacesavers2_grubbers:00000.00s:version: v0.10.2-dev
usage: spacesavers2_grubbers [-h] -f FILESGZ [-l LIMIT] [-o OUTFILE] [-v]

spacesavers2_grubbers: get list of large duplicates sorted by total size

options:
-h, --help show this help message and exit
-f FILESGZ, --filesgz FILESGZ
spacesavers2_mimeo prefix.<user>.files.gz file
spacesavers2_mimeo prefix.<user>.mimeo.files.gz file
-l LIMIT, --limit LIMIT
stop showing duplicates with total size smaller then (5 default) GiB
stop showing duplicates with total size smaller than (5 default) GiB. Set 0 for unlimited.
-o OUTFILE, --outfile OUTFILE
output tab-delimited file (default STDOUT)
-v, --version show program's version number and exit

Version:
v0.5
v0.10.2-dev
Example:
> spacesavers2_grubbers -f /output/from/spacesavers2_mimeo/prefix.files.gz
> spacesavers2_grubbers -f /output/from/spacesavers2_finddup/prefix.files.gz
```

### Outputs
Expand All @@ -40,18 +43,16 @@ The output is displayed on STDOUT and is tab-delimited with these columns:
| ------ | ------------------------------------- |
| 1 | combined hash |
| 2 | number of duplicates found |
| 3 | total size of all duplicates |
| 4 | size of each duplicate |
| 5 | ";"-separated list of duplicates |
| 6 | duplicate files |
| 3 | total size of all duplicates (human readable) |
| 4 | size of each duplicate (human readable) |
| 5 | original file |
| 6 | ";"-separated list of duplicates files |

Here is an example output line:

```bash
ca269c980de3f0d8e6668b88d9065c8f#5003f92f52d71437741e4e79c4339a66 3 21.99 GiB 7.33 GiB "/data/CCBR/ccbr754_Yoshimi/ccbr754/workdir_170403_postinitialrnas
eq2/0h_1_S25.p2.Aligned.toTranscriptome.sorted.bam";"/data/CCBR/ccbr754_Yoshimi/ccbr754targz/data/CCBR/projects/ccbr754/workdir_170403_postinitialrnaseq2/0h_1_S25.p2.Aligned.toTr
anscriptome.sorted.bam";"/data/CCBR/ccbr754_Yoshimi/ccbr754targz/data/CCBR/projects/ccbr754/workdir_170403_postinitialrnaseq2/0h_1_S25.p2.Aligned.toTranscriptome.sorted.sorted.ba
m"
183e9dc341073d9b75c817f5ed07b9ac#183e9dc341073d9b75c817f5ed07b9ac 5 0.07 KiB 0.01 KiB "/data/CCBR/abdelmaksoudaa/test/a" "/data/CCBR/abdelmaksoudaa/test/b";"/data/CCBR/abde
lmaksoudaa/test/c";"/data/CCBR/abdelmaksoudaa/test/d";"/data/CCBR/abdelmaksoudaa/test/e";"/data/CCBR/abdelmaksoudaa/test/f"
```

> `spacesavers2_grubbers` is typical used to find the "low-hanging" fruits ... aka ... the "high-value" duplicates which need to be deleted first to quickly have the biggest impact on the users overall digital footprint.
44 changes: 30 additions & 14 deletions docs/mimeo.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
## spacesavers2_mimeo

This takes in the `ls_out` generated by `spacesavers2_catalog` and processes it to:
This takes in the `catalog` file generated by `spacesavers2_catalog` and processes it to:

- find duplicates
- create per-user summary reports for each user (and all users).

### Inputs

- `--lsout` is the output file from `spacesavers2_catalog`. Thus, `spacesavers2_catalog` needs to be run before running `spacesavers2_mimeo`.
- `--catalog` is the output file from `spacesavers2_catalog`. Thus, `spacesavers2_catalog` needs to be run before running `spacesavers2_mimeo`.
- `--maxdepth` maximum folder depth upto which reports are aggregated
- `--outdir` path to the output folder
- `--prefix` prefix to be added to the output file names eg. date etc.
Expand All @@ -16,17 +16,16 @@ This takes in the `ls_out` generated by `spacesavers2_catalog` and processes it

```bash
% spacesavers2_mimeo --help
spacesavers2_mimeo:00000.02s:version: v0.5
usage: spacesavers2_mimeo [-h] -f LSOUT [-d MAXDEPTH] [-o OUTDIR] [-p PREFIX] [-q QUOTA] [-z | --duplicatesonly | --no-duplicatesonly]
usage: spacesavers2_mimeo [-h] -f CATALOG [-d MAXDEPTH] [-o OUTDIR] [-p PREFIX] [-q QUOTA] [-z | --duplicatesonly | --no-duplicatesonly] [-k | --kronaplot | --no-kronaplot] [-v]

spacesavers2_mimeo: find duplicates

options:
-h, --help show this help message and exit
-f LSOUT, --catalog LSOUT
-f CATALOG, --catalog CATALOG
spacesavers2_catalog output from STDIN or from catalog file
-d MAXDEPTH, --maxdepth MAXDEPTH
folder max. depth upto which reports are aggregated
folder max. depth upto which reports are aggregated ... absolute path is used to calculate depth (Default: 10)
-o OUTDIR, --outdir OUTDIR
output folder
-p PREFIX, --prefix PREFIX
Expand All @@ -35,16 +34,21 @@ options:
total quota of the mount eg. 200 TB for /data/CCBR
-z, --duplicatesonly, --no-duplicatesonly
Print only duplicates to per user output file.
-k, --kronaplot, --no-kronaplot
Make kronaplots for duplicates.(ktImportText must be in PATH!)
-v, --version show program's version number and exit

Version:
v0.5
v0.10.2-dev
Example:
> spacesavers2_mimeo -f /output/from/spacesavers2_catalog -o /path/to/output/folder -d 7 -q 10
> spacesavers2_mimeo -f /output/from/spacesavers2_catalog -o /path/to/output/folder -d 7 -q 10 -k
```

### Outputs

After completion of run, `spacesavers2_mimeo` creates `.files.gz` (list of duplicate files) and `.summary.txt` (overall stats at various depths) files in the provided output folder. Here are the details:
After completion of run, `spacesavers2_mimeo` creates `*.mimeo.files.gz` (list of files per user + one "allusers" file) and `.summary.txt` (overall stats at various depths) files in the provided output folder. if `-k` is provided (and ktImportText from [kronatools](https://github.com/marbl/Krona/wiki/KronaTools) is in PATH) then krona specific TSV and HTML pages are also generated. It also generates a `blamematrix.tsv` file with folders on rows and users on columns with duplicate bytes per-folder-per-user. This file can be used to create a "heatmap" to pinpoint folder with highest duplicates overall as well as on a per-user basis.

Here are the details:

#### Duplicates

Expand All @@ -54,7 +58,7 @@ After completion of run, `spacesavers2_mimeo` creates `.files.gz` (list of dupli
- Check if each bin has unique sized files. If a bin has more than 1 size, then it needs to be binned further. Sometimes, xxHash of top and bottom chunks also gives the same combination of hash for differing files. These files will have different sizes. Hence, re-bin them accordingly.
- If same size, then check inodes. If all files in the same bin have the same inode, then these are just hard-links. But, if there are multiple inodes, then we have **duplicates**!
- If we have duplicates, then `spacesavers2_mimeo` keeps track of number of duplicates per bin. Number of duplicates is equal to number of inodes in each bin minus one.
- If we have duplicates, then the oldest find is identified and considered to be the original file. All other files are marked _duplicate_, irrespective of user id.
- If we have duplicates, then the oldest file is identified and considered to be the original file. All other files are marked _duplicate_, irrespective of user id.
- duplicate files are reported in gzip format with the following columns for all users and per-user basis

Here is what the `.files.gz` file columns (space-separated) represent:
Expand All @@ -63,17 +67,19 @@ Here is what the `.files.gz` file columns (space-separated) represent:
| ------ | ------------------------------------------------ |
| 1 | top chunk and bottom chunk hashes separated by "#" |
| 2 | separator ":" |
| 3 | Number of duplicates |
| 3 | Number of duplicates files (not duplicate inodes) |
| 4 | Size of each file |
| 5 | List of users duplicates serapated by "##" |

Each file in the last column above is ":" separated with the same 13 items as described in the `ls_out` file. The only difference is that the user id and group id are now replaced by user name and group name.
> NOTE: Number of dupicate files can be greater than number of duplicate inodes as each file can have multiple hard links already. Hence, while calculating total duplicate bytes we use (total_number_of_unique_inodes_per_group_of_duplicate_files - 1) X size_of_each_file. The "minus 1" is to not count the size of the original file.

Each file in the last column above is ";" separated with the same 13 items as described in the `catalog` file. The only difference is that the username and groupame are now appended to each file entry.

Along with creating one `.files.gz` and `.summary.txt` file per user encountered, `spacesavers2_mimeo` also generates a `allusers.files.gz` file for all users combined. This file is later used by `spacesavers2_blamematrix` as input.
Along with creating one `.mimeo.files.gz` and `.mimeo.summary.txt` file per user encountered, `spacesavers2_mimeo` also generates a `allusers.mimeo.files.gz` file for all users combined. This file is later used by `spacesavers2_blamematrix` as input.

#### Summaries

Summaries, files ending with `.summary.txt` are collected and reported for all users (`allusers.summary.txt`) and per-user (`USERNAME.summary.txt`) basis for user-defined depth (and beyond). The columns (tab-delimited) in the summary file:
Summaries, files ending with `.mimeo.summary.txt` are collected and reported for all users (`allusers.mimeo.summary.txt`) and per-user (`USERNAME.mimeo.summary.txt`) basis for user-defined depth (and beyond). The columns (tab-delimited) in the summary file:

| Column | Description |
| ------ | ------------------------------------- |
Expand All @@ -93,3 +99,13 @@ Summaries, files ending with `.summary.txt` are collected and reported for all u

For columns 10 through 13, the same logic is used as [spacesavers](https://ccbr.github.io/spacesavers/usage/df/).

#### KronaTSV and KronaHTML

- KronaTSV is tab-delimited with first column showing the number of duplicate bytes and every subsequent column giving the folder depths.
- ktImportText is then used to convert the KronaTSV to KronaHTML which can be shared easily and only needs a HTML5 supporting browser for viewing.

#### Blamematrix

- rows are folders as 1 level deeper than the "mindepth"
- columns are all individual usernames, plus an "allusers" column
- only duplicate-bytes are reported
8 changes: 4 additions & 4 deletions docs/usurp.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ The GRUBBER file has the following columns:
| ------ | ------------------------------------- |
| 1 | combined hash |
| 2 | number of duplicates found |
| 3 | total size of all duplicates |
| 4 | size of each duplicate |
| 5 | ";"-separated list of duplicates |
| 6 | duplicate files |
| 3 | total size of all duplicates (human readable) |
| 4 | size of each duplicate (human readable) |
| 5 | original file |
| 6 | ";"-separated list of duplicates files |

```bash
usage: spacesavers2_usurp [-h] -g GRUBBER -x HASH [-f | --force | --no-force]
Expand Down
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,5 @@ nav:
- catalog: catalog.md
- mimeo: mimeo.md
- grubbers: grubbers.md
- blamematrix: blamematrix.md
- usurp: usurp.md
- e2e: e2e.md
Loading
Loading