Skip to content

DEDISbench: A disk I/O block-based benchmark for deduplication systems. Unlike other existing benchmarks, written content is generated in a realistic fashion that mimics the content found on real storage distributions.

License

GPL-2.0, GPL-2.0 licenses found

Licenses found

GPL-2.0
LICENSE
GPL-2.0
COPYING
Notifications You must be signed in to change notification settings

haslab/dedisbench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

/* DEDISbench and DEDISgen

  • (c) 2010 2010 U. Minho. Written by J. Paulo */

DEDISbench and DEDISgen

DEDISbench is implemented in C and allows performing read and write block disk I/O tests or in a raw device. DEDISbench main contribution is the ability to process, as input, a file that specifies a distribution of duplicated content and using this information for generating a synthetic workload that follows such distribution. This file can be populated by the users or can be generated automatically with DEDISgen, an analysis tool that can be used for processing a specific real dataset and extracting from it the duplicates information.

I/O operations can be executed concurrently by several processes with independent files. Moreover, the benchmark can be configured to stop the evaluation when a certain amount of data has been written or when a pre-defined period of time has elapsed. Another novel feature is the possibility of performing I/O operations with different load intensities. In addition to a stress load where the benchmark issues I/O operations as fast as possible to stress the system, DEDISbench supports performing operations at a nominal load specified by the user.

For more information regarding DEDISbench algorithm you may read these two published papers:

  • "DEDISbench: A Benchmark for Deduplicated Storage Systems"
  • "Towards an Accurate Evaluation of Deduplicated Storage Systems"

I/O benchmark:

The benchmark is written in C and can simulate several processes writing/reading fixed size blocks concurrently into multiple files. The location for the write operations is given by NURand function from TPC-C benchmark that generates hotspots for I/O operations. A realistic distribution extracted from real storage systems is used to generate the blocks' content. Moreover, it is possible to use other distributions and load them from a custom file using DEDISgen. DEDISbench provides two workload modes, one reproduces a fully or peak loaded system, as Bonnie++, that performs as maximum write I/O operations per second as possible. The second reproduces a system under a reasonable nominal load, and can be useful for understanding the behavior of aliasing operations in a stable system with a limited I/O throughput.

DEDISbench real content distributions:

DEDISbench comes with three distinct distributions extracted from real storage systems with different requirements and assumptions:

  • An Archival storage where most files have a write-once policy, with non-significant updates, but with sporadic data deletion. This distribution is called dist_archival.
  • A Personal Files storage where some files are updated and deleted frequently and the I/O requests latency is expected to be lower than the one found in archival storages. This distributions is called dist_personalfiles.
  • A High Performance storage where most files are updated and deleted frequently and I/O latency is expected to be as minimal as possible. This distribution is called dist_highperf.

All these distributions are available and can be simulated by DEDISbench with the option -g or it is possible to input a distribution file.

NOTE: By default DEDISbench uses the Personal Files Storage distribution.

Duplicate distribution generator: DEDISgen

The binary DEDISgen generates an output file that describes the distribution of duplicates found at a specific folder and subfolders or disk device. This program can generate (if specified in the arguments) an output distribution file (with option -o) that can be consumed by DEDISbench and used for simulating that duplicate distribution.

With DEDISgen we also pack DEDISgenutils that may be needed for performing more complex analysis of duplicated data. For instance, if several datasets must be analysed separately and the merged toguether for generating the output distribution file for DEDISbench it is possible to use these tools as explained below.

Instalation.

The only libs required should be libc6-dev and libdb-dev libssl-dev

Run ./configure, make and make install in DEDISbench folder

Running the Benchmark

Required flags

-p or -n Peak or Nominal Load with throughput rate of N operations per second. If mixed benchmark of read and writes is defined then use -nr and -nw for nominal rate of reads and writes respectively.

-w or -r or -m Write or Read Benchmark or a mix of write and read operations.

-t or -s Benchmark duration (-t) in Minutes or amount of data to write (-s) in MB

Optional Flags -a Access pattern for I/O operations: 0-Sequential, 1-Random Uniform, 2-TPCC random default:2

-c Number of concurrent processes default:4. Each process has an independent file associated or a common device if -i flag is used

-f Size of the file of each process in MB. If -i flag is used, this parameter defines the size of the raw device. default:2048 MB

-d choose the directory where DEDISbench writes data

-i Processes write/read from a raw device instead of having an independent file. If more than one process is defined, each process is assingned with an independet of the raw device, dependent on the raw device size. By default, if this flag is not set each process writes to an individual file.

-l I/O latency results are written to a log file to extract additional statistics. Each process writes these values in a file called result(processid) and each line, corresponds to a single I/O operations and presents: (latency of I/O operation in microseconds) (current time in seconds) \n

-e or -y Enable or disable the population of process files before running DEDISbench. test(processid files in DEDISbench directory are the files used by each process. For write testes and by default, empty files are created, for read tests and by default the files are populated with content. It is possible to enable this feature for write tests and acess rewrite operations or to disable and use a custom file that respects the namings.

-b Size of blocks for I/O operations in Bytes default: 4096

-g Input File with duplicate distribution default:internal file called dist_personalfiles DEDISbench can simulate three real distributions extracted respectively from an Archival, Personal Files and High Performance Storage; For choosing these distributions the must be dist_archival, dist_personalfiles or dist_highperf respectively. The input file details the amount of blocks with a certain number of duplicates and the format is: <number_duplicates> <number_blocks> See below for more info for customizing distribution files and above for info on the default distributions.

-v Seed for random generator default:current time. Usefull for repeating

-o generate an output log with the distribution actually generated by the benchmark

-k generate an output log with the access pattern generated by the benchmark

-z Disable the destruction of process temporary files generated by the benchmark

-j Write to file path the output of DEDISbench. This feature also writes two additional files with the same name as given in and a snaplat and snapthr suffix that shows the throughput and lattency average values for 30 seconds intervals

-o The output of DEDIS gives the exact number of unique and duplicate blocks written. (This option requires using additional shared memory)

-x I/O Operations synchronization: 0-without fsync and O_DIRECT, 1-O_DIRECT, 2-fsync, 3-both. default: 0

Examples:

run write benchmark in peak mode for 5 minutes

./DEDISbench -p -w -t5

run read benchmark in nominal mode (300 reads/second) for 10 minutes

./DEDISbench -n300 -r -t10

run write benchmark in peak mode for 5 minutes. Enable the population of process files before actually running the benchmark load and enable log output for results. Use files with 4GB for each process and 8 processes.

./DEDISbench -p -w -t5 -e -l -f4096 -c8

run read benchmark in nominal mode (300 reads/second) for 8 minutes. Disable the pre-population of files. Carefull because files test(processid) must be present and have content or no content will be available for reading resulting in I/O errors.

./DEDISbench -n300 -r -t8 -d

run 8 minutes mixed benchmark with a nominal load (100reads/s and 50 writes/s) in a raw device with 4GB.

./DEDISbench -m -nr100 -nw50 -t8 -ipath/to/rawdevice

Running the generator DEDISgen:

-f or -d Find duplicates in folder -f or in a Disk Device -d

-p Path for the folder or disk device

-o Path for the output distribution file. This is only necessary if distribution file of duplicates is going to be generated

Optional Parameters

-z Path for the folder where duplicates databases are created default: ./gendbs/ . duplicate_db is the database with duplicates information hash->number_dups

-b Size of blocks for I/O operations in Bytes default: 4096

-o Path for the output distribution file. This is only necessary if distribution file of duplicates is going to be generated

Examples

generate duplicate distribution for files in folder /dir/duplicates/ and its subfolders

./DEDISgen -f -p/dir/duplicates/

generate duplicate distribution for device /path/device. Use a block size of 8192 Bytes (8KB)

./DEDISgen -d -p/path/device -b8192

Deduplication distribution FILE:

This file describes the amount of blocks with a specific number of duplicates and has the following format:

Example file:

0 1000 1 500 5 10

There are 1000 blocks without any duplicate, 500 blocks with one duplicate and 10 blocks with 5 duplicates

NEW! DEDISgenutils

UTILS for DEDISgen:

-m and -o can be used together or individually if only one operation is intended

-m Merges hashes and duplicates of db1 into db2.

-o Path for generating the output distribution file of duplicates db.

WARNING: these are supposed to be used in duplicates databases generated by dedisgen that map (hash->number-duplicates). Not intended for distribution dbs that generate the distribution outputs. These databases are on the database folder, which can be specified by the user as parameter with the option -z and by default are on ./gendbs/duplicatedb

Examples:

Merging the duplicates found separately, with DEDISgen, for dataset 1 (in db1) and 2 (in dbw) db1 with db2

Merge the two duplicates databases. Merge db1 into db2 - WARNING db2 info will be updated!!!! ./DEDISgenutils -mdb1 db2

generate output with distribution for db1+db2 ./DEDISgenutils -odb2 output_dist

For more information please contact: Joao Paulo jtpaulo at di.uminho.pt

About

DEDISbench: A disk I/O block-based benchmark for deduplication systems. Unlike other existing benchmarks, written content is generated in a realistic fashion that mimics the content found on real storage distributions.

Resources

License

GPL-2.0, GPL-2.0 licenses found

Licenses found

GPL-2.0
LICENSE
GPL-2.0
COPYING

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published