-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weibo senti 100k is very likely labelled by the emoticons #1
Comments
lol, the findings are really interesting! @ThiagoSousa |
@ThiagoSousa Yeah. Thank you for your comments. |
thx for your work. |
could I use it in bert and how I should do the preprocessing for the data? are emoticons out of vocabulary? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I downloaded this dataset(ChineseNlpCorpus/datasets/weibo_senti_100k) to train a model for chinese sentiment analysis. Upon treating this dataset I observed that 100% of the posts contain emoticons. Here is the distribution of the top10 emoticons according to the positive and negative polarity:
1013 emoticons in total. They are: [('泪', 44489), ('哈哈', 40510), ('嘻嘻', 22370), ('抓狂', 17262), ('鼓掌', 15923), ('爱你', 12685), ('怒', 12011), ('衰', 10466), ('晕', 9440), ('偷笑', 8375)]
710 emoticons in the positive set. They are: [('哈哈', 35764), ('嘻嘻', 20115), ('鼓掌', 14836), ('爱你', 11349), ('偷笑', 5223), ('太开心', 3820), ('可爱', 3809), ('心', 2122), ('赞', 1991), ('给力', 1976)]
695 emoticons in the negative set. They are: [('泪', 43248), ('抓狂', 16643), ('怒', 11830), ('衰', 10202), ('晕', 9022), ('哈哈', 4746), ('偷笑', 3152), ('蜡烛', 2887), ('汗', 2456), ('嘻嘻', 2255)]
I trained a very simple model to classify and I obtained 98% of accuracy in 2 epochs. Therefore, the emoticons have a strong bias in the classification. It led me to conclude that this dataset is not manually annotated. Probably whoever annotated the dataset manually classified some frequent emoticons and use them to tag the posts. Just saying for anyone who want to gather this data, you'd probably like to clean the emoticons out of it to avoid bias.
Peace!
The text was updated successfully, but these errors were encountered: