Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
  • Loading branch information
kopardev committed Feb 7, 2024
1 parent 99b3c49 commit ccef883
Show file tree
Hide file tree
Showing 15 changed files with 241 additions and 486 deletions.
4 changes: 3 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

- adding `requirements.txt` for easy creation of environment in "spacesavers2" docker (#68, @kopardev)
- `grubbers` has new `--outfile` argument.
- `blamematrix` has 3 new arguments `--humanreable`, `--includezeros` and `--outfile`.
- `blamematrix` has now been moved into `mimeo`.
- `mimeo` files.gz always includes the original file as the first one in the filelist.
- `mimeo` now has kronatools compatible output. ktImportText is also run if in PATH to generate HTML report for duplicates only. (#46, @kopardev)
- documentation updated.
Expand All @@ -23,6 +23,8 @@
- `blamematrix` fixed to account for changes due to #71
- `usurp` fixed to account for changes due to #71. Now using the new "original file" column while creating hard-links.
- `e2e` overhauled, improved and well commented.
- total size now closely resemble `df` results (fix #75 @kopardev)
- files with future timestamps are handles correctly (fix #76, @kopardev)

## spacesavers2 0.10.2

Expand Down
8 changes: 1 addition & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,13 @@ Welcome! `spacesavers2`:

> New improved parallel implementation of [`spacesavers`](https://github.com/CCBR/spacesavers). `spacesavers` is soon to be decommissioned!
> Note: `spacesavers2` requires [python version 3.11](https://www.python.org/downloads/release/python-3110/) or later and the [xxhash](https://pypi.org/project/xxhash/) library. These dependencies are already installed on biowulf (as a conda env). The environment for running `spacesavers2` can get set up using:
>
> ```bash
> . "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh" && \
> conda activate py311
> ```
> Note: `spacesavers2` requires [python version 3.11](https://www.python.org/downloads/release/python-3110/) or later and the [xxhash](https://pypi.org/project/xxhash/) library. These dependencies are already installed on biowulf (as a conda env).
## `spacesavers2` has the following Basic commands:

- spacesavers2_catalog
- spacesavers2_mimeo
- spacesavers2_grubbers
- spacesavers2_blamematrix
- spacesavers2_e2e
- spacesavers2_usurp

Expand Down
Binary file modified docs/assets/images/spacesavers2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 0 additions & 49 deletions docs/blamematrix.md

This file was deleted.

1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,5 @@ nav:
- catalog: catalog.md
- mimeo: mimeo.md
- grubbers: grubbers.md
- blamematrix: blamematrix.md
- usurp: usurp.md
- e2e: e2e.md
144 changes: 0 additions & 144 deletions spacesavers2_blamematrix

This file was deleted.

39 changes: 24 additions & 15 deletions spacesavers2_catalog
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,17 @@ from pathlib import Path


def task(f):
if not os.path.isfile(f):
return ""
else:
fd = FileDetails()
fd.initialize(
f,
buffersize=args.buffersize,
thresholdsize=args.ignoreheadersize,
tb=args.buffersize,
sed=sed,
bottomhash=args.bottomhash,
)
return "%s" % (fd)
fd = FileDetails()
fd.initialize(
f,
buffersize=args.buffersize,
thresholdsize=args.ignoreheadersize,
tb=args.buffersize,
sed=sed,
bottomhash=args.bottomhash,
st_block_byte_size=args.st_block_byte_size,
)
return "%s" % (fd)


def main():
Expand Down Expand Up @@ -84,14 +82,23 @@ def main():
help="this sized header of the file is ignored before extracting buffer of buffersize for xhash creation (only for special extensions files) default = 1024 * 1024 * 1024 bytes",
)
parser.add_argument(
"-s",
"-x",
"--se",
dest="se",
required=False,
type=str,
default="bam,bai,bigwig,bw,csi",
help="comma separated list of special extensions (default=bam,bai,bigwig,bw,csi)",
)
parser.add_argument(
"-s",
"--st_block_byte_size",
dest="st_block_byte_size",
required=False,
default=512,
type=int,
help="st_block_byte_size on current filesystem (default 512)",
)
parser.add_argument(
"-o",
"--outfile",
Expand Down Expand Up @@ -120,7 +127,9 @@ def main():

folder = args.folder
p = Path(folder)
files = p.glob("**/*")
files = [p]
files2 = p.glob("**/*")
files.extend(files2)

if args.outfile:
outfh = open(args.outfile, "w")
Expand Down
21 changes: 8 additions & 13 deletions spacesavers2_e2e
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ source ${SCRIPT_DIR}/resources/argparse.bash || exit 1
argparse "$@" <<EOF || exit 1
parser.add_argument('-f','--folder',required=True, help='Folder to run spacesavers_catalog on.')
parser.add_argument('-p','--threads',required=False, help='number of threads to use', default=4)
parser.add_argument('-d','--maxdepth',required=False, help='maxdepth for mimeo', default=4)
parser.add_argument('-l','--limit',required=False, help='limit for running spacesavers_grubbers', default=5)
parser.add_argument('-v','--level',required=False, help='level for running spacesavers_blamematrix', default=3)
parser.add_argument('-q','--quota',required=False, help='total size of the volume (default = 200 for /data/CCBR)', default=200)
parser.add_argument('-o','--outfolder',required=True, help='Folder where all spacesavers_e2e output files will be saved')
EOF
Expand Down Expand Up @@ -54,21 +54,25 @@ spacesavers2_catalog \
> ${outfile_catalog_log} 2> ${outfile_catalog_err}
fi

sleep 60

# spacesavers2_mimeo
if [ "$?" == "0" ];then
echo "Running spacesavers2_mimeo"
command -V ktImportText 2>/dev/null || module load kronatools || (>&2 echo "module kronatools could not be loaded"; exit 1)
command -V ktImportText 2>/dev/null || module load kronatools || (>&2 echo "module kronatools could not be loaded")
spacesavers2_mimeo \
--catalog ${outfile_catalog} \
--outdir ${OUTFOLDER} \
--quota $QUOTA \
--duplicatesonly \
--maxdepth 3 \
--maxdepth $MAXDEPTH \
--p $prefix \
--kronaplot \
> ${outfile_mimeo_log} 2> ${outfile_mimeo_err}
fi

sleep 60

# spacesavers2_grubbers
if [ "$?" == "0" ];then
echo "Running spacesavers2_grubbers" && \
Expand All @@ -84,13 +88,4 @@ for filegz in `ls ${OUTFOLDER}/${prefix}*files.gz`;do
done
fi

# spacesavers2_blamematrix
if [ "$?" == "0" ];then
echo "Running spacesavers2_blamematrix" && \
spacesavers2_blamematrix \
--filesgz ${OUTFOLDER}/${prefix}.allusers.mimeo.files.gz \
--level $LEVEL \
--outfile ${outfile_blamematrix} \
> ${outfile_blamematrix_log} 2> ${outfile_blamematrix_err}
fi
echo "Done!"
echo "Done!"
6 changes: 3 additions & 3 deletions spacesavers2_grubbers
Original file line number Diff line number Diff line change
Expand Up @@ -92,19 +92,19 @@ def main():
of = sys.stdout

for fgitem in dups:
if fgitem.totalsize < top_limit:
if fgitem.totalsize <= top_limit:
break
saved += fgitem.totalsize
of.write("%s\n"%(fgitem))

if args.outfile:
of.close()

saved = get_human_readable_size(saved)
hrsaved = get_human_readable_size(saved)
print_with_timestamp(
start=start,
scriptname=scriptname,
string="Deleting top grubbers will save {}!".format(saved),
string="Deleting top grubbers will save {} [ {} Bytes ] !".format(hrsaved,saved),
)
print_with_timestamp(start=start, scriptname=scriptname, string="Done!")

Expand Down
Loading

1 comment on commit ccef883

@kopardev
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • get_file_depth moved to FileDetails method.
  • folder :
    • now reported in catalog (Updates to the parallel task function)
    • size counted and added to "non-dup" bytes
    • correctly included to Summary class with folder_Bytes class variable
    • partially fixes Total sizes of folders do not add up #75
  • age is set to zero for all future timestamps. fixes Bug with files with future timestamp #76
  • dfUnit changes:
    • calculated_size_list class variable added to track "calculated_size" or actual size occupied by file on disk (based on blocks used)
    • file list separator changed from ";" to "##" as some filenames contain ";"
    • sizes retained in bytes and no longer converted to human readable formats
    • FileDetails2 class no longer in use ... so deleted
    • fgzblamer class no longer in use ... so deleted
  • Flowchart updated for v0.11.0
  • README updated (with new flowchart!)
  • blamematrix code and doc file deleted. mkdocs.yml updated accordingly
  • new -s option to catalog for specifying st_block_byte_size which defaults to 512 bytes
  • e2e updated to remote blamematrix explicit call as it is not part of mimeo
  • grubbers reports sizes in bytes in output file and both, bytes and human readable format, for summary on terminal.
  • knorachart logic updated to derive from mimeo intermediate dictionaries ... faster and more accurate.
  • FileDetails class replaces is_symlink with fld ... aka file or link or directory (or unknown)

Please sign in to comment.