Sentiment analysis is the task of classifying the polarity of a given text.
The IMDb dataset is a binary sentiment analysis dataset consisting of 50,000 reviews from the Internet Movie Database (IMDb) labeled as positive or negative. Models are evaluated based on accuracy.
Model | Score | Paper / Source |
---|---|---|
ULMFiT (Howard and Ruder, 2018) | 95.4 | Universal Language Model Fine-tuning for Text Classification |
Block-sparse LSTM (Gray et al., 2017) | 94.99 | GPU Kernels for Block-Sparse Weights |
oh-LSTM (Johnson and Zhang, 2016) | 94.1 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Virtual adversarial training (Miyato et al., 2016) | 94.1 | Adversarial Training Methods for Semi-Supervised Text Classification |
BCN+Char+CoVe (McCann et al., 2017) | 91.8 | Learned in Translation: Contextualized Word Vectors |
The Stanford Sentiment Treebank contains of 215,154 phrases with fine-grained sentiment labels in the parse trees of 11,855 sentences in movie reviews. Models are evaluated either on fine-grained (five-way) or binary classification based on accuracy.
Fine-grained classification:
Model | Accuracy | Paper / Source |
---|---|---|
BCN+ELMo (Peters et al., 2018) | 54.7 | Deep contextualized word representations |
BCN+Char+CoVe (McCann et al., 2017) | 53.7 | Learned in Translation: Contextualized Word Vectors |
Binary classification:
Model | Accuracy | Paper / Source |
---|---|---|
Block-sparse LSTM (Gray et al., 2017) | 93.2 | GPU Kernels for Block-Sparse Weights |
bmLSTM (Radford et al., 2017) | 91.8 | Learning to Generate Reviews and Discovering Sentiment |
BCN+Char+CoVe (McCann et al., 2017) | 90.3 | Learned in Translation: Contextualized Word Vectors |
Neural Semantic Encoder (Munkhdalai and Yu, 2017) | 89.7 | Neural Semantic Encoders |
BLSTM-2DCNN (Zhou et al., 2017) | 89.5 | Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling |
The Yelp Review dataset consists of more than 500,000 Yelp reviews. There is both a binary and a fine-grained (five-class) version of the dataset. Models are evaluated based on error (1 - accuracy; lower is better).
Fine-grained classification:
Model | Error | Paper / Source |
---|---|---|
ULMFiT (Howard and Ruder, 2018) | 29.98 | Universal Language Model Fine-tuning for Text Classification |
DPCNN (Johnson and Zhang, 2017) | 30.58 | Deep Pyramid Convolutional Neural Networks for Text Categorization |
CNN (Johnson and Zhang, 2016) | 32.39 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Char-level CNN (Zhang et al., 2015) | 37.95 | Character-level Convolutional Networks for Text Classification |
Binary classification:
Model | Error | Paper / Source |
---|---|---|
ULMFiT (Howard and Ruder, 2018) | 2.16 | Universal Language Model Fine-tuning for Text Classification |
DPCNN (Johnson and Zhang, 2017) | 2.64 | Deep Pyramid Convolutional Neural Networks for Text Categorization |
CNN (Johnson and Zhang, 2016) | 2.90 | Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings |
Char-level CNN (Zhang et al., 2015) | 4.88 | Character-level Convolutional Networks for Text Classification |
SemEval (International Workshop on Semantic Evaluation) has a specific task for Sentiment analysis. Latest year overview of such task (Task 4) can be reached at: http://www.aclweb.org/anthology/S17-2088
SemEval-2017 Task 4 consists of five subtasks, each offered for both Arabic and English:
-
Subtask A: Given a tweet, decide whether it expresses POSITIVE, NEGATIVE or NEUTRAL sentiment.
-
Subtask B: Given a tweet and a topic, classify the sentiment conveyed towards that topic on a two-point scale: POSITIVE vs. NEGATIVE.
-
Subtask C: Given a tweet and a topic, classify the sentiment conveyed in the tweet towards that topic on a five-point scale: STRONGLYPOSITIVE, WEAKLYPOSITIVE, NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.
-
Subtask D: Given a set of tweets about a topic, estimate the distribution of tweets across the POSITIVE and NEGATIVE classes.
-
Subtask E: Given a set of tweets about a topic, estimate the distribution of tweets across the five classes: STRONGLYPOSITIVE, WEAKLYPOSITIVE, NEUTRAL, WEAKLYNEGATIVE, and STRONGLYNEGATIVE.
Subtask A results:
Model | F1-score | Paper / Source |
---|---|---|
LSTMs+CNNs ensemble with multiple conv. ops (Cliche. 2017) | 0.685 | BB twtr at SemEval-2017 Task 4: Twitter Sentiment Analysis with CNNs and LSTMs |
Deep Bi-LSTM+attention (Baziotis et al., 2017) | 0.677 | DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis |
Sentihood is a dataset for targeted aspect-based sentiment analysis (TABSA), which aims to identify fine-grained polarity towards a specific aspect. The dataset consists of 5,215 sentences, 3,862 of which contain a single target, and the remainder multiple targets. F1 is used as evaluation metric for aspect detection and accuracy as evaluation metric for sentiment analysis.
Model | Aspect | Sentiment | Paper / Source |
---|---|---|---|
Liu et al. (2018) | 78.5 | 91.0 | Recurrent Entity Networks with Delayed Memory Update for Targeted Aspect-based Sentiment Analysis |
SenticLSTM (Ma et al., 2018) | 78.2 | 89.3 | Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM |
LSTM-LOC (Saeidi et al., 2016) | 69.3 | 81.9 | Sentihood: Targeted aspect based sentiment analysis dataset for urban neighbourhoods |