Skip to content

This repository contains annotated data on inappropriate language in online discussions, generated through a combination of expert annotation, crowd-sourcing, and ChatGPT-based methods.

License

Notifications You must be signed in to change notification settings

cltl/InappropriateLanguageDetection

 
 

Repository files navigation

Task description:

This repository contains annotated data on inappropriate language in online discussions, generated through a combination of expert annotation, crowd-sourcing, and ChatGPT-based methods.

annotations:

ChatGPT_explicit: This subfolder contains annotations of explicit inappropriate language identified by ChatGPT.
ExplicitlyInappropriateLanguageInContext: Here, you will find both crowd and expert annotations that highlight instances of explicitly inappropriate language.

codes:

Includes scripts and code used for data processing, analysis, etc.

data:

Holds the raw and processed data used for annotation and analysis. This includes input data in various formats and intermediate data sets generated during processing.

LingoTurk files:

Contains files related to the LingoTurk platform, which was used for collecting annotations. This includes task configurations and instructions.

statistics:

Includes statistical reports and summaries derived from the data set.

the analysis of annotations:

Contains detailed analyses of annotation results, including comparisons between different annotation methods, inter-annotator agreements, error analysis, and insights into annotation discrepancies.

Usage:

Researchers and developers interested in content moderation, natural language processing, and online discourse analysis can benefit from this data set and associated resources.

Citation:

If you use this data set or findings from this repository in your research or projects, please consider citing this repository and our paper.
Citing the paper: https://aclanthology.org/2024.trac-1.11/

@inproceedings{barbarestani-etal-2024-content, title = "Content Moderation in Online Platforms: A Study of Annotation Methods for Inappropriate Language", author = "Barbarestani, Baran and Maks, Isa and Vossen, Piek T.J.M.", editor = "Kumar, Ritesh and Ojha, Atul Kr. and Malmasi, Shervin and Chakravarthi, Bharathi Raja and Lahiri, Bornini and Singh, Siddharth and Ratan, Shyam", booktitle = "Proceedings of the Fourth Workshop on Threat, Aggression {&} Cyberbullying @ LREC-COLING-2024", month = may, year = "2024", address = "Torino, Italia", publisher = "ELRA and ICCL", url = "https://aclanthology.org/2024.trac-1.11", pages = "96--104"}


Citing the repository: https://github.com/cltl/InappropriateLanguageDetection # Contact
Please feel free to ask any questions you may have by contacting me via b[dot]barbarestani[at]vu[dot]nl.

About

This repository contains annotated data on inappropriate language in online discussions, generated through a combination of expert annotation, crowd-sourcing, and ChatGPT-based methods.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published