Skip to content

Latest commit

 

History

History
54 lines (38 loc) · 4.6 KB

annotation.md

File metadata and controls

54 lines (38 loc) · 4.6 KB

📚 Sentence-level Fact-checked Annotated dataset

Hyphen fine-grained explainability evaluation - Annotated dataset release! 💿

This is the first-ever sentence-level fact-checked dataset.

Abstract: Fake news 📰 is often generated by manipulating only a small part of the true information i.e. entities, relations, small parts of a sentence, or a paragraph. It is possible that certain true information is also present in the news piece to make it more appealing to the public, and thus it is crucial to distinguish between true and/or fake parts of a piece of information. Thus, we utilise and release a sentence-level fact-checked annotated dataset. We annotate the Politifact dataset with ground truth evidence corresponding to different parts of the news text, by referring to fact-checking websites Politifact and Gossipcop, and other trustable online sources.

Annotation Process 📝: To evaluate the efficacy of Hyphen in producing suitable explanations, we fact- check and annotate the Politifact dataset on a sentence-level. Each sentence has the following possible labels – true, false, quote, unverified, non_check_worthy or noise.

  • The annotators were further supposed to arrange the fact-checked sentences in the order of their check-worthiness. We take the help of four expert annotators in the age-group of 25-30 years. The final labels for a sentence were decided on the basis of majority voting amongst the four annotators.
  • To decide the final rank-list (since different annotators might have different opinions about the level of check-worthiness of the sentences), the fourth annotator compiled the final rank-list by referring to the fact-checked rank- lists by the first three annotators using Kendall’s $\tau$ and Spearman’s $\rho$ rank correlation coefficients, and manually observing the similarities between the three rank-lists.
  • The compiled list is then cross- checked and re-evaluated by the first three annotators for consistency.

Dataset format ⚅

Every CSV file, politifact-annotation/politifact{news_id}.csv represents the sentences from the news article politifact{news_id}. Every CSV file follows the schema #sample, #sent_id, #sentences, #label.

Label interpretations

The following table lays down the meanings of each label in the annotated dataset:

Label Explanation
true After verification from online sources, if it can be deduced that a claim being introduced by a sentence is true, then we label it True.
fake After verification from online sources, if it can be deduced that a claim being introduced by a sentence is false, then we label it False.
non_checkworthy If a given sentence is not check-worthy for fake news detection, we label it as not check-worthy,
quote If a given sentence is a quote from someone’s speech/tweet/etc, label it as a quote.
unverified If a situation arises that we are unable to arrive at any conclusion regarding the veracity of a particular sentence (after consulting to all online resources), then we label it as unverified.
noise Owing to scraping errors, it is possible that a sentence in the original scraped dataset would be a noisy one, and in such cases, we label it as noise.

What counts as a reliable source for Fact-checking?

  • Fact-checking websites like Politifact, Gossipcop and Snopes.
  • Trusted news providers like BBC News, The Indian Express, Economic Times, Hindustan Times, The Hindu, CNN, The New York Times, Reports from Reuters, etc.
  • We don’t count Wikipedia as a reliable source.
  • We don’t count any social media platform like Twitter, Reddit, Facebook, etc, as reliable. However, any tweet/post from a verified social media account would be counted as reliable (i.e. Blue ticks).
  • Comments on social media posts by verified accounts count as a reliable source of information.

📞 Contact

If you have any questions or issues, please feel free to reach out Karish Grover at [email protected].

✏️ Citation

If you think that this annotated dataset is helpful, please feel free to leave a star ⭐️ and cite our paper:

@article{grover2022public,
  title={Public Wisdom Matters! Discourse-Aware Hyperbolic Fourier Co-Attention for Social-Text Classification},
  author={Grover, Karish and Angara, SM and Akhtar, Md and Chakraborty, Tanmoy and others},
  journal={arXiv preprint arXiv:2209.13017},
  year={2022}
}