Skip to content

Latest commit

 

History

History
60 lines (52 loc) · 4.41 KB

Datasets.md

File metadata and controls

60 lines (52 loc) · 4.41 KB

Dataset Description

File Length Vocabulary Size Brief Description
Real Data
webster 41.1M 98 HTML data of the 1913 Webster Dictionary, from the Silesia corpus
text8 100M 27 First 100M of English text (only) extracted from enwiki9
enwiki9 500M 206 First 500M of the English Wikipedia dump on 2006
mozilla 51.2M 256 Tarred executables of Mozilla 1.0, from the Silesia corpus
h. chr20 64.4M 5 Chromosome 20 of H. sapiens GRCh38 reference sequence
h. chr1 100M 5 First 100M bases of chromosome 1 of H. Sapiens GRCh38 sequence
c.e. genome 100M 4 C. elegans whole genome sequence
ill-quality 100M 4 100MB of quality scores for PhiX virus reads sequenced with Illumina
np-bases 300M 5 Nanopore sequenced reads (only bases) of a human sample (first 300M symbols)
np-quality 300M 91 Quality scores for nanopore sequenced human sample (first 300M symbols)
num-control 159.5M 256 Control vector output between two minimization steps in weather-satellite data assimilation
obs-spitzer 198.2M 256 Data from the Spitzer Space Telescope showing a slight darkening
msg-bt 266.4M 256 NPB computational fluid dynamics pseudo-application bt
audio 264.6M 256 First 600 files (combined) in ESC Dataset for environmental sound classification
Synethetic Data
XOR-k 10M 2 Pseudorandom sequence Entropy rate 0 bpc.
HMM-k 10M 2 Hidden Markov sequence , with , Entropy rate 0.46899 bpc.

Links to the Datasets and Trained Boostrap Models

File Link Bootstrap Model
webster http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia webster
mozilla http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia mozilla
h. chr20 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr20.fa.gz chr20
h. chr1 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr1.fa.gz chr1
c.e. genome ftp://ftp.ensembl.org/pub/release-97/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz celegchr
ill-quality http://bix.ucsd.edu/projects/singlecell/nbt_data.html phixq
text8 http://www.mattmahoney.net/dc/textdata.html text8
enwiki9 http://www.mattmahoney.net/dc/textdata.html enwiki9
np-bases https://github.com/nanopore-wgs-consortium/NA12878 npbases
np-quality https://github.com/nanopore-wgs-consortium/NA12878 npquals
num-control https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/ model
obs-spitzer https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/ model
msg-bt https://userweb.cs.txstate.edu/~burtscher/research/datasets/FPdouble/ model
audio https://github.com/karolpiczak/ESC-50 model

Synethetic Dataset Generation Example

  1. Go to Datasets
  2. For real datasets, run
bash get_data.sh
  1. For synthetic datasets, run
# For generating XOR-10 dataset
python generate_data.py --data_type 0entropy --markovity 10 --file_name files_to_be_compressed/xor10.txt
# For generating HMM-10 dataset
python generate_data.py --data_type HMM --markovity 10 --file_name files_to_be_compressed/hmm10.txt
  1. This will generate a folder named files_to_be_compressed. This folder contains the parsed files which can be used to recreate the results in our paper.