The Datasets API

The Dataset Classes

Ovation provides classes for accessing several datasets through the same API. The dataset files are expected to be located in /data/datasets/<dataset-name>. Each dataset is divided in three splits: train, validation and test.

Supported Datasets

The table below show the datasets supported by Ovation. It may be still incomplete, and we are working on supporting more datasets in the future. The descriptions of the datasets are generally excerpts of the official websites.

Dataset	Class	Task	Licence	Description
STS	STS, STSLarge	IC	Multiple?	This is a merge of all the IC datasets
SICK	Sick	IC	(couldn't find)
Microsoft Paraphrase Dataset	MSPD	IC	(couldn't find)
PPDB: The Paraphrase Database	PPDB	IC	CC-BY 3.0
Quora Similar questions dataset	Quora	IC	https://www.quora.com/about/tos
SemEval	SemEval	IC	CC-BY-SA 3.0
Stack Exchange	StackExchange	IC	CC-BY-SA 3.0
Annotated Corpus for NER	Acner	NER	(couldn't find)
GermEval 2014 NER	Germeval	NER	CC-BY 4.0	The GermEval 2014 NER Shared Task builds on a new dataset with German Named Entity annotation with the following properties: (1) The data was sampled from German Wikipedia and News Corpora as a collection of citations. (2) The dataset covers over 31,000 sentences corresponding to over 590,000 tokens. (3)The NER annotation uses the NoSta-D guidelines, which extend the Tübingen Treebank guidelines, using four main NER categories with sub-structure, and annotating embeddings among NEs such as [ORG FC Kickers [LOC Darmstadt]].
GerSEN	Gersen	SA	GerSEN's license (basically: the dataset private, non-commercial, and publications must cite it)	The dataset consists of 2,369 sentiment-annotated sentences. The sentences are extracted from German news articles. The news articles are related to the Berlin's universities.
Hotel Review Dataset	HotelReviews	SA	(couldn't find)	This dataset consists of 878561 reviews (1.3GB) from 4333 hotels crawled from TripAdvisor.
CrowdFlower Twitter Emotion Dataset	TwitterEmotion	SA	CC Public Domain	In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. Hundreds to thousands of examples across 13 labels. A subset of this data is used in an experiment we uploaded to Microsoft's Cortana Intelligence Gallery. Added: July 15, 2016 by CrowdFlower
Amazon Review Dataset DE	AmazonReviewsGerman	SA	Private (scraped by Insiders)	Crawled data from Amazon German reviews

Examples

Getting a batch

The code below initializes an object of the type Acner. Acner is one of the 14 datasets supported by Ovaion. (you can get an overview of the supported datasets in the section Supported Datasets below.)

# Instantiates a new element of the class of the dataset
acner = Acner()

# You can use the variable `epochs_completed` for 
while acner.train.epochs_completed < 10:

    # You can get a new batch with `next_batch()`
    train_batch = acner.train.next_batch(
                             # By default, the batch size is always 64
                             batch_size=64,
                             # `pad` makes sense for sequences. Here
                             # we pad the sequences with an invalid
                             # character so that all instances of the
                             # batch have 40 elements
                             pad=40,
                             # If `one_hot` is not True, we get only
                             # a sequence of numbers with the index
                             # of each word of the sequence in the
                             # vocabulary.
                             one_hot=True)

    # do something with the batch. E.g.,
    train_step(train_batch.sentences,
               train_batch.pos,
               train_batch.ner)

There are several parameters that may be accepted by next_batch(). We recommend you to explore the possible parameters.

Creating a new vocabulary

When you instantiate an object of one of the supported datasets, a vocabulary is loaded (or created, if it still doesn't exist). There may be several reasons why you might think that the provided vocabulary is not suited for your purposes. You can, therefore, create a new vocabulary with:

# Instantiate a new object of the dataset STS
sts = STS()

#
sts.create_vocabulary(
                  #                                                 |
                  # 
                  min_frequency=min_frequency,
                  tokenizer=tokenizer,
                  downcase=downcase,
                  max_vocab_size=max_vocab_size,
                  name=name)

For example, you could consider that your vocabulary is too big or too small. One way to change the size of the vocabulary by only considering tokens that appear at least a certain number n of times in the dataset. For all our datasets, this number n is 5. This may not be suited for very small datasets, where actually relevant tokens might appear only once or twice. On the other hand, if a dataset is too big, then even very irrelevant tokens might appear many more times than 5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly