The output dataset is an extention of the existing input dataset retrieved from the SMS Spam Collection Dataset.
This repo stores the input dataset, the dataset with the embeddings and the code used to generate this dataset.
The original dataset contains 5574 english messages each labelled as spam or ham This dataset contains 4 columns:
v1
-> Target column specifying if the message is spam or hamv2
-> The original unprocessed messagesUnnamed_col_1
&Unnamed_col_2
-> Columns with mostly missing values (around 99%) that are discarded
The output encoded dataset contains the same information as the input dataset plus the additional DiltilBERT classification embeddings. This results in a dataset with 770 columns:
spam
-> Target column specifying if the message is spam or hamoriginal_message
-> The original unprocessed messages0
up to768
-> columns containing the DistilBERT classification embeddings for the message, after it being processed
HuggingFace's DistilBERT is used from their transformers package.
Jay Allamar's tutorial is followed to encode the messages using DistilBERT.
For memory efficiency reasons all messages are first stripped from punctuation and then english stopwords are removed. Then only the first 30 tokens are kept.
As per my analysis of this dataset on kaggle it can be seen that most ham messages have around 10 words and spam messages around 29 words, without stopwords. This means that once stopwords are removed from the messages, keeping the first 30 tokens might mean some information loss but not to critical. (Acrually in my analysis it is demonstrated that encoding the messages using only the first 10 tokens after processing them is enough to have a good encoding capable of achieving 88.1 ROC-AUC with a baseline random forest.)
Jay Allamar's tutorial is followed to encode the messages using DistilBERT.
The original dataset is part of the UCI Machine Learning repository and can be found here.
UCI Machine Learning urges to if you find the original dataset useful, cite the original authors found here.
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011