-
Notifications
You must be signed in to change notification settings - Fork 17
The Datasets API
Ovation provides classes for accessing several datasets through the same API. The dataset files are expected to be located in /data/datasets/<dataset-name>
. Each dataset is divided in three splits: train
, validation
and test
.
The table below show the datasets supported by Ovation. It may be still incomplete, and we are working on supporting more datasets in the future. The descriptions of the datasets are generally excerpts of the official websites.
In the table below,
- IC: Intent Classification
- NER: Named Entity Recognition
- SA: Sentiment Analysis
Dataset | Class | Task | Licence |
---|---|---|---|
STS | STS, STSLarge | IC | Multiple? |
SICK | Sick | IC | (couldn't find) |
Microsoft Paraphrase Dataset | MSPD | IC | (couldn't find) |
PPDB: The Paraphrase Database | PPDB | IC | CC-BY 3.0 |
Quora Similar questions dataset | Quora | IC | https://www.quora.com/about/tos |
SemEval | SemEval | IC | CC-BY-SA 3.0 |
Stack Exchange | StackExchange | IC | CC-BY-SA 3.0 |
Annotated Corpus for NER | Acner | NER | (couldn't find) |
GermEval 2014 NER | Germeval | NER | CC-BY 4.0 |
GerSEN | Gersen | SA | GerSEN's license (basically: the dataset private, non-commercial, and publications must cite it) |
Hotel Review Dataset | HotelReviews | SA | (couldn't find) |
CrowdFlower Twitter Emotion Dataset | TwitterEmotion | SA | CC Public Domain |
Amazon Review Dataset DE | AmazonReviewsGerman | SA | Private (scraped by Insiders) |
Below is a description of each one of the datasets:
Dataset | Description |
---|---|
STS | This is a merge of all the IC datasets |
SICK | |
Microsoft Paraphrase Dataset | |
PPDB: The Paraphrase Database | |
Quora Similar questions dataset | |
SemEval | |
Stack Exchange | |
Annotated Corpus for NER | |
GermEval 2014 NER | The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties: (1) The data was sampled from German Wikipedia and News Corpora as a collection of citations. (2) The dataset covers over 31,000 sentences corresponding to over 590,000 tokens. (3)The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]]. |
GerSEN | The dataset consists of 2,369 sentiment-annotated sentences. The sentences are extracted from German news articles. The news articles are related to the Berlin's universities. |
Hotel Review Dataset | This dataset consists of 878561 reviews (1.3GB) from 4333 hotels crawled from TripAdvisor. |
CrowdFlower Twitter Emotion Dataset | In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft's Cortana Intelligence Gallery. Added: July 15, 2016 by CrowdFlower |
Amazon Review Dataset DE | Crawled data from Amazon German reviews |
The code below initializes an object of the type Acner
. Acner
is one of the 14 datasets supported by Ovation. (you can get an overview of the supported datasets in the section
Supported Datasets
below.)
# Instantiates a new element of the class of the dataset
acner = Acner()
# You can use the variable `epochs_completed` for
while acner.train.epochs_completed < 10:
# You can get a new batch with `next_batch()`
train_batch = acner.train.next_batch(
# By default, the batch size is always 64
batch_size=64,
# `pad` makes sense for sequences. Here
# we pad the sequences with an invalid
# character so that all instances of the
# batch have 40 elements
pad=40,
# If `one_hot` is not True, we get only
# a sequence of numbers with the index
# of each word of the sequence in the
# vocabulary.
one_hot=True)
# do something with the batch. E.g.,
train_step(train_batch.sentences,
train_batch.pos,
train_batch.ner)
There are several parameters that may be accepted by next_batch()
. We recommend you to explore the possible parameters.
When you instantiate an object of one of the supported datasets, a vocabulary is loaded (or created, if it still doesn't exist). There may be several reasons why you might think that the provided vocabulary is not suited for your purposes. You can, therefore, create a new vocabulary with:
# Instantiate a new object of the dataset STS
sts = STS()
# Creates a new vocabulary. This will create new files for the
# vocabulary with names related to the parameter `name`. It will
# also change internal variables in `sts` to point to these new
# files.
sts.create_vocabulary(
# Only tokens appearing at least `min_frequency`
# times in the dataset will be inserted in the
# vocabulary.
min_frequency=10,
# Chooses the tokenizer to be used to create the
# vocabulary. Possible options are 'spacy', 'nltk',
# 'split' and 'other'.
# The 'split' tokenizer is just a call to
# `str.split(' ')`.
tokenizer='split',
# If `downcase` is true, then the counting of words
# in the dataset is not case sensitive. While this
# may be useful for English, notice that this may
# dangerous for German, where capitalized words are
# very common.
downcase=False,
# After all words were counted, we still discard
# some of the less frequent words so that the size
# of the vocabulary does not exceed
# `max_vocab_size`.
max_vocab_size=10000,
# The `name` is useful to indicate the name of the
# new files to be created and stored in the hard
# disk.
name='my_new_vocab',
# If `load_w2v` is True, then it will create a new
# vector representation for each one of the words
# in the vocabulary (which is what we want). If it
# is False, then it won't change the current vector
# representations.
load_w2v=True)
# With the new vocabulary created, we can, for example, start a new
# model. E.g.,
train(sts, sts.metadata_path, sts.w2v)
# (see the code templates for an idea of what the call to `train`
# above means)
For example, you could consider that your vocabulary is too big or too small. One way to change the size of the vocabulary by only considering tokens that appear at least a certain number n of times in the dataset. For all our datasets, this number n is 5. This may not be suited for very small datasets, where actually relevant tokens might appear only once or twice. On the other hand, if a dataset is too big, then even very irrelevant tokens might appear many more times than 5.
These dataset objects that you create also have additional information about the dataset. The following code snippet shows all of them,
# Instantiate a new object of the dataset STS
sts = STS()
print(sts.dataset_path)
# path to the directory where the dataset is located
print(sts.train_path)
# path to the training file
print(sts.validation_path)
# path to the validation file
print(sts.test_path)
# path to the test file
print(sts.vocab_path)
# path to the vocabulary file. This file is a tab separated
# file with every line containing a term and its frequency separated by tabs
print(sts.metadata_path)
# This file is exactly the same as the vocab file but has a header on the top
# This file is used for Tensorflow Embedding visualization
print(sts.w2v_path)
# Path to the preloaded word2vec matrix that corresponds to the vocabulary
# It is a .npy file and can be loaded using numpy
print(sts.w2i)
# A dictionary with terms in the vocabulary as keys and their corresponding ids as values
print(sts.i2w)
# A dictionary with ids if the terms in the vocabulary as keys and terms themselves as values
print(sts.w2v[:10])
# Prints the preloaded word embeddings for the first 10 terms in the vocabulary
print(sts.vocab_size)
# size of the vocabulary
# A helper class for fetching preprocessed training batches
sts.train
# A helper class for fetching preprocessed validation batches
sts.validation
# A helper class for fetching preprocessed test batches
sts.test
To look into the code here and explore all the datasets.