classifier_last.tex

\section{Incidence of annotations on suppervised polarity classification}
\label{sect:classifier}
\texttt{<SVM-based classifier: Matlab, (LibSVM?)>}
This section intends to evaluate the incidence of AMT-generated annotations on the polarity classification task.
According to this, a comparative evaluation between two polarity classification systems is conducted. 
More specifically, baseline or reference classifiers trained with noisy available metadata are compared with 
contrastive classifiers trained with AMT generated annotations.  
Although more sophisticated classification schemas can be conceived for this task, a simple SVM-based binary supervised classification approach is considered here.

\subsection{Description of datasets}
For conducting the experimental evaluation, three different datasets were considered:

\begin{enumerate}
\item Baseline: constitutes the dataset used for training the baseline or reference classifiers. 
Automatic annotation for this dataset was obtained by using the following naive approach: those sentences extracted from
comments with ratings equal to 5 were assigned to class "positive", those extracted from comments with ratings 
equal to 3 were assigned to "neutral", and those extracted from comments with ratings equal to 1 were assigned to
"negative". This dataset contains a total of 5570 sentences, with a vocabulary coverage of 11797 words. 

\item Annotated: constitutes the dataset that was manually annotated by AMT workers.
This dataset is used for training the contrastive classifiers which are to be compared with baseline systems.
The three independent annotations generated by AMT workers for each sentence within this dataset were consolidated into one unique annotation
by using the following criterion: if the three provided annotations happened to be
different\footnote{Actually, this kind of total disagreement among annotators occurred only in 13 sentences out of 1000.}, 
the sentence was assigned to class "neutral"; otherwise, the sentence was assigned to the class with
at least two annotation agreements. This dataset contains a total of 1000 sentences, with a vocabulary coverage 
of 3022 words. 

\item Evaluation: constitutes the gold standard used for evaluating the performance of classifiers.
This dataset was manually annotated by three experts in an independent manner. The gold standard annotation
was consolidated by using the same criterion used in the case of the previous dataset\footnote{In this case, 
annotator inter-agreement was above 80\%, and total disagreement among annotators occurred only in 1 sentence
out of 500}. This dataset contains a total of 500 sentences, with a vocabulary coverage of 2004 words.    
\end{enumerate} 

These three datasets were constructed by randomly extracting sample sentences from an original corpus
of over 25000 comments containing more than 1000000 sentences in total. The sampling was conducted 
with the following constraints in mind: the three resulting datasets should not overlap, only sentences 
containing more than 3 tokens could be extracted, each resulting dataset must be balanced, as much
as possible, in terms of the amount of sentences per class. Table \ref{tc_corpus} presents the
distribution of sentences per class for each of the three considered datasets.  

\begin{table}
\begin{tabular}{|l|l|l|l|}
\hline
&Baseline &Annotated &Evaluation \\
\hline
Positive &1882 &341 &200 \\
\hline
Negative &1876 &323 &137 \\
\hline
Neutral &1812 &336 &161 \\
\hline
Totals &5570 &1000 &500 \\
\hline
\end{tabular}
\caption{Sentence-per-class distributions for baseline, annotated and evaluation datasets.}
\label{tc_corpus}
\end{table}

\subsection{Experimental settings}
As mentioned above, a simple SVM-based supervised classification approach was considered for the
polarity detection task under consideration. According to this, two different groups of classifiers were 
considered: a baseline or reference group, and a contrastive group. Classifiers within these two groups were
trained with data samples extracted from the baseline and annotated datasets, respectively. Within each group 
of classifiers, three different binary classification subtasks were considered: positive/not_positive, 
negative/not_negative and neutral/not-neutral. All trained binary classifiers were evaluated by computing 
precision and recall for each considered class, as well as overall classification accuracy, over the 
evaluation dataset.

A feature space model representation of the data was constructed by considering the standard bag-of-words approach. 
In this way, a sparse vector was obtained for each sentence in the datasets. Stop-word removal was not
conducted before computing vector models, and standard normalization and TF-IDF weighting schemes were used.

Multiple-fold cross-validation was used in all conducted experiments to tackle with statistical variability of the 
data. In this sense, twenty independent realizations were actually conducted for each experiment presented and,
instead of individual output results, mean values and standard deviations of evaluation metrics are reported.

Each binary classifier realization was trained with a random subsample set of 600 sentences extracted from 
the training dataset corresponding to the classifier group, i.e. baseline dataset for reference systems, 
and annotated dataset for contrastive systems. Training subsample sets were always balanced with respect to 
the original three categories: "positive", "negative" and "neutral".

\subsection{Results and discussion}
Table \ref{tc_pre_rec} presents the resulting mean values of precision and recall for each considered class 
in classifiers trained with either the baseline or the annotated dataset. As observed in the table, with the
exception of recall for class "negative" and precision for class "not_negative", both metrics are substantially 
improved when the annotated dataset is used for training the classifiers. The most impressive improvements
are observed for "neutral" precision and recall. 

\begin{table}
\begin{tabular}{|l|l|l|}
\hline
class &precision &recall \\ 
\hline
positive &50.10 (3.79) &62.00 (7.47) \\
&60.21 (2.07)  &71.00 (2.18) \\ 
\hline
not_positive &69.64 (2.70) &58.05 (7.54) \\ 
&77.95 (1.32) &68.54 (2.75) \\ 
\hline
negative &35.25 (2.63) &53.46 (10.55) \\ 
&39.07 (1.78) &55.52 (3.26) \\ 
\hline
not_negative &78.04 (2.19) &62.62 (6.76) \\ 
&79.73 (1.10) &66.87 (2.31) \\ 
\hline
neutral &32.51 (3.02) &48.03 (7.33) \\ 
&44.72 (2.00) &67.12 (2.96) \\ 
\hline
not_neutral &68.17 (2.65) &52.81 (3.84) \\ 
&79.41 (1.58) &60.40 (2.96) \\ 
\hline
\end{tabular}
\caption{Mean precision and recall over 20 independent simulations (with standard deviations provided in parenthesis) 
for each considered class in classifiers trained with either the baseline dataset (upper values) or the annotated dataset (lower values).}
\label{tc_pre_rec}
\end{table}

Table \ref{tc_accu} presents the resulting mean values of accuracy for each considered subtask 
in classifiers trained with either the baseline or the annotated dataset. As observed in the table,
all subtasks benefit from using the annotated dataset for training the classifiers; however, it is 
important to mention that while similar absolute gains are observed for the "positive/not_positive" 
and "neutral/not_neutral" subtasks, this is not the case for the subtask "negative/not_negative", 
which actually gains much less than the other two subtasks.

\begin{table}
\begin{tabular}{|l|l|l|}
\hline
classifier &baseline &annotated \\ 
\hline
positive/not_positive &59.63 (3.04) &69.53 (1.70) \\ 
\hline
negative/not_negative &60.09 (2.90) &63.73 (1.60) \\ 
\hline
neutral/not_neutral &51.27 (2.49) &62.57 (2.08) \\ 
\hline
\end{tabular}
\caption{Mean accuracy over 20 independent simulations(with standard deviations provided in parenthesis) 
for each classification subtasks trained with either the baseline or the annotated dataset.}
\label{tc_accu}
\end{table}

After considering all evaluation metrics, it is evident the important benefit provided by human-annotated data 
availability for classes "neutral" and "positive". However, in the case of class "negative", although some 
gain is also observed, the benefit of human-annotated data does not seem to be as much as for the two other 
classes. This, along with the fact that the "negative/not_negative" subtask is actually the best performing
one (in terms of accuracy) when baseline training data is used, might suggest that low rating comments contains 
a better representation of sentences belonging to class "negative" than medium and high rating comments do with
respect to classes "neutral" and "positive". 

In any case, this experimental work just verifies the feasibility of constructing training datasets for
opinionated content analysis, as well as it provides an approximated idea of costs involved in the generation
of this type of resources, by using AMT.