DataHog is a web application for analyzing how your storage space is being used. The app builds a database of files stored in iRODS collections (such as the CyVerse data store), Amazon S3 buckets, or directories on your device, and allows you to search, sort, and compare them. It provides information about file sizes, types, and duplicated files.
DataHog is available as an app on the CyVerse Discovery Environment. Simply click "Launch Analysis" to start up a new instance.
The latest DataHog image is hosted on DockerHub.
Alternatively, you can build it yourself by running docker build . -t datahog
in the root directory.
The app runs on port 8000 of the container, so you'll want to publish the port using something like this:
docker run -it -p 8000:8000 <name:tag>
If you want to set up DataHog locally for development, follow these steps:
- Install SQLite 3
- Install Python 3.6.6
- Install RabbitMQ
- Install the pip packages in
django/requirements.txt
- Run
python manage.py migrate
inside thedjango
directory to populate your database. - Run
python manage.py runserver
to start the server. - In another terminal, run
celery -A celery_app worker
to start a task worker process. - Install Node.JS 8.12.0
- Install the npm packages using
npm install
inside thereact
directory. - Run
npm run js
to build the JS files (the build will auto-refresh if you keep it running).
The launch page offers five options for importing file data into DataHog:
- iRODS: Use the iRODS API to import data from a specific collection. The options for importing files from the CyVerse data store are prefilled.
- .datahog File: Upload a .datahog file containing file data. These can be generated by a Python script which you can download and run on any machine (see: Crawler Script).
- CyVerse: Use the CyVerse file search API to import any data stored in the data store. This method currently does not support exact duplicate matching, and may be slower than iRODS in some cases.
- S3 Bucket: Use your AWS access keys to import an S3 bucket, or a specific directory from one.
- Restore Database: If you previously backed up a DataHog database, you can upload it to restore your data.
Depending on how many files are being scanned, the import process can take a few minutes to complete. Some extremely large directories (millions of files) may take much longer–feel free to close the tab and check up on it later if you wish.
Once the import process for your first file source is complete, you will have access to 4 tabs:
- Summary: View a summary of each of your file sources, including various file rankings and visualizations.
- Browse Files: Explore the folder structure for each of your file sources, or search your files using names, regex expressions, or date and size filters. - Each column header can be clicked to sort the table by that value.
- Duplicated Files: View a list of files with identical contents. By default, this page uses checksums to compare files, but file sizes or names can also be used. Each column header can be clicked to sort the table by that value.
- Manage File Sources: Import a new file source, remove an existing one, or download a backup of the current file database.
The DataHog Crawler Script is a small Python 3 program used to scan a directory and generate a .datahog
file, which can be imported directly into DataHog.
You can run it like so:
python3 datahog_crawler.py <root path> [<options>]
The script calculates MD5 checksums for each file it scans, in order to detect duplicated files. This can be slow for large directories, so you can use the -n
or --no-checksums
option to disable this.
By default, the script creates a file called <directory name>.datahog
, but this can be overridden with the -o
or --output
option.