Skip to content

hateRep: Hate Speech Data Repeating Annotation (without and with background knowledge)

Notifications You must be signed in to change notification settings

preyero/hateRep

Repository files navigation

Python DOI

Data Repository: ORDO - 10.21954/ou.rd.26212604.v1

Paper Preprint: ORO-198676

Semantic-Enhanced Crowdsourcing Study for Target Group Identification - Code

This is the source code to reproduce the paper: Enhancing Hate Speech Annotations with Background Semantics (ECAI 2024): https://oro.open.ac.uk/98676/

The Data repository is available in Open Research Data Online (ORDO).

Repo structure

The raw data is organised in the following folders:

  • Annotators: anonymised demographic tables from Prolific. Each participant appears in one file only, subject to being (i) heterosexual cis men (M_MH), (ii) heterosexual cis women (W_WH), or LGBTQ+ member because of their (iii) gender (trans, G_T, or non-binary, G_NB) or (iv) sexuality (non-heterosexual, S_H).

  • Data: contains semantic and crowdsourcing annotations. Crowdsourcing annotations were collected as shown in the example figure and full documentation.

  • Semantic_annotation: Jupyter notebooks to provide background knowledge to the hate speech sample using a knowledge graph, i.e., the GSSO (pruned_concepts.csv) and other linguistic resources (missing_concepts.csv).

  • Documentation: contains the approved Ethics Application Form and Participant Information Sheet.

Source code is in scripts, specifically in the Python files:

  • dataCollect.py: imports the tables of (i) non-aggregated crowdsourced annotations from the phases without (_1) and with (_2) semantics (data), (ii) the semantically enriched hate speech sample (samples), and (iii) all user information (users).

  • agreement.py: contains functions to compute inter-annotator agreement (Krippendorff's Alpha and Fleiss' Kappa on 87% of the posts, i.e., with 6 annotations).

  • helper.py: helper functions to analyse alignment (Pearson's correlation) and change after semantics (categorisation by agreement and decision made on target groups).

  • utils.py: functions for table plot (agreement and correlation, Figure 2), horizontal bar and Sankey diagram (frequency and shifts, Figure 3) and, heatmap (categories overlap, Figure 4).

All files used for evaluation in the paper are in folder results.

Run files

The code runs in Python version 3.12 using packages in requirements.txt:

    hateRep <user-login>$ python main.py

Phase 2 Annotation Example (with semantics)

There is a PDF showing the full annotation study with examples provided by participants.

Texts in Phase 2 were annotated as shown below:

drawing

In Phase 1, the same layout is presented but without underlined terms in the post and with an empty column on the left.