Skip to content

TomMoeras/parallel-pygamma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Parallel Gamma Agreement

Python scripts for computing gamma agreement scores across multiple annotators and sentences in parallel. This is particularly useful for large-scale annotation projects where multiple annotators have labeled spans of text and where large compute is available and computing gamma agreement would take a long time without some parallel computing. It makes use of the pygamma-agreement package. The only thing added is the possibility to do batch processing with some useful logging. Future plan is to integrate this into pygamma directly. Still a work in progress.

Features

  • Parallel processing of multiple sentences
  • Support for both JSON and CSV input formats

Installation

  1. Clone the repository:
git clone https://github.com/TomMoeras/parallel-pygamma
cd parallel-pygamma
  1. Install required packages:
pip install -r requirements.txt

Usage

Quick Start

Run the demo script:

python src/demo.py

The demo allows you to try both input formats with example data.

Input Formats

1. JSON Format (Multiple Annotators)

Each annotator's file should be named mapped_annotations_<annotator_id>.json and contain:

[
    {
    "id": 0,
    "text": "Example sentence text",
    "word": "target_word",
    "label": [
            {
            "text": "annotated span",
            "start": 0,
            "end": 14,
            "labels": ["label_category"]
            }
        ]
    }
]

2. CSV Format (Per Sentence)

Each CSV file should contain columns:

Annotator,Sentence,Annotated Text,Start,End,Label
A1,"Example text","annotated span",0,14,category

Using in Your Project

from core.gamma import GammaAgreementProcessor

# Setup logging with a shared timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
logger = setup_logging(log_dir, timestamp)

#Initialize components
processor = GammaAgreementProcessor(
    output_dir="output",
    log_dir="logs"
    timestamp=timestamp
)

#Process annotations
parallel_processor.run(
    input_dir="your_input_dir",
    batch_size=4,
    max_workers = None  # Will use default based on CPU count
    max_annotators = None  # Optional limit on number of annotators per sentence
)

Output

The tool generates:

  1. CSV files containing processed annotations
  2. Final results file with gamma scores

Results are saved in:

  • output/csv/: Intermediate CSV files
  • output/results/: Final gamma scores
  • logs/: Processing logs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages