Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weibo senti 100k is very likely labelled by the emoticons #1

Open
ThiagoSousa opened this issue Aug 6, 2018 · 4 comments
Open

Weibo senti 100k is very likely labelled by the emoticons #1

ThiagoSousa opened this issue Aug 6, 2018 · 4 comments

Comments

@ThiagoSousa
Copy link

I downloaded this dataset(ChineseNlpCorpus/datasets/weibo_senti_100k) to train a model for chinese sentiment analysis. Upon treating this dataset I observed that 100% of the posts contain emoticons. Here is the distribution of the top10 emoticons according to the positive and negative polarity:

1013 emoticons in total. They are: [('泪', 44489), ('哈哈', 40510), ('嘻嘻', 22370), ('抓狂', 17262), ('鼓掌', 15923), ('爱你', 12685), ('怒', 12011), ('衰', 10466), ('晕', 9440), ('偷笑', 8375)]

710 emoticons in the positive set. They are: [('哈哈', 35764), ('嘻嘻', 20115), ('鼓掌', 14836), ('爱你', 11349), ('偷笑', 5223), ('太开心', 3820), ('可爱', 3809), ('心', 2122), ('赞', 1991), ('给力', 1976)]

695 emoticons in the negative set. They are: [('泪', 43248), ('抓狂', 16643), ('怒', 11830), ('衰', 10202), ('晕', 9022), ('哈哈', 4746), ('偷笑', 3152), ('蜡烛', 2887), ('汗', 2456), ('嘻嘻', 2255)]

I trained a very simple model to classify and I obtained 98% of accuracy in 2 epochs. Therefore, the emoticons have a strong bias in the classification. It led me to conclude that this dataset is not manually annotated. Probably whoever annotated the dataset manually classified some frequent emoticons and use them to tag the posts. Just saying for anyone who want to gather this data, you'd probably like to clean the emoticons out of it to avoid bias.

Peace!

@OYE93
Copy link

OYE93 commented Jan 4, 2019

lol, the findings are really interesting! @ThiagoSousa

@jinhuakst
Copy link
Contributor

@ThiagoSousa Yeah. Thank you for your comments.

@arsentiii
Copy link

thx for your work.

@easywaytodo
Copy link

could I use it in bert and how I should do the preprocessing for the data? are emoticons out of vocabulary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants