Location Mention Recognition (LMR) Fine-Tuning Script

Overview

This repository contains a script designed for fine-tuning the GLiNER model to recognize and classify location mentions within text data. The script utilizes the Hugging Face GLiNER library and includes customizations for data processing, model training, and evaluation. It supports training with large, pre-trained GLiNER models, which have been specifically tailored for Named Entity Recognition (NER) tasks like location recognition.

Requirements

Python 3.8+
CUDA-enabled GPU for model training
Necessary Python packages (install via requirements.txt):
- transformers
- torch
- datasets
- wandb
- gliner

Environment Variables

This script requires several environment variables for managing resources and logging during model training:

TOKENIZERS_PARALLELISM: Controls parallelism in tokenization to improve performance.
CUDA_DEVICE_ORDER: Ensures that CUDA devices are initialized in a consistent order.
CUDA_VISIBLE_DEVICES: Specifies which GPU to use for training.
WANDB_PROJECT: Defines the project name for Weights and Biases logging.
WANDB_WATCH & WANDB_NOTEBOOK_NAME: Used for logging with Weights and Biases.

Dataset

The training and test datasets should be JSON files containing pre-processed location mention data. The script expects the following format:

[
    {
        "tokens": [...],
        "ner": [[start_index, end_index, "LOC"]],
        "label": ["location"]
    }
]

Dataset Paths:

Training Data: data/accepted_data/TrainCleaned.json
Test Data: data/accepted_data/TestCleaned.json

The script automatically updates the NER labels to "location".

Zindi dataset to train this model can be found here

Usage

Model Setup

The script can load several pre-trained GLiNER models for fine-tuning, including:

urchade/gliner_large-v2.1 (default)
urchade/gliner_small-v2.1
urchade/gliner_medium-v2.1

You can select a different model by changing the GLiNER.from_pretrained() method in the code.

Fine-Tuning

The script fine-tunes the selected model on the provided dataset, using the Trainer class from the GLiNER library. Several training parameters, such as batch size, learning rate, and number of epochs, can be configured. Key parameters include:

Batch Size: Set to 8 by default.
Learning Rate: Default is 1e-6 with weight decay for regularization.
Epochs: Number of epochs for training, set to 5.
Evaluation Strategy: Evaluation is done periodically based on the number of steps.
Save Strategy: Models are saved every save_steps interval, with the best model loaded at the end.

Execution

To start training, run:

python <script_name>.py

Logging

The script uses Weights and Biases for logging training progress, loss metrics, and saving checkpoints.

To use Weights and Biases, ensure you have set up a project by setting the environment variable WANDB_PROJECT and logging in via wandb login.

Customization

Data Collator: The data collator is configured to prepare labels for location mentions. You can modify this if you need to handle different types of entities.
Trainer Arguments: You can fine-tune other hyperparameters in the TrainingArguments section such as gradient accumulation steps, learning rate scheduler, and evaluation strategy.
Model: Various GLiNER models are available for specific use cases. To try a different pre-trained model, change the model's path in the from_pretrained() method.

Example Output

The script will output:

Model summary.
Dataset size information.
Sample dataset entries.
Training logs with loss metrics.
Model evaluation results.

Notes

This script is optimized for large datasets and uses a constant learning rate for fine-tuning. Adjust the learning rate or use a different scheduler (e.g., cosine, linear) depending on your dataset and hardware.
The script is configured to load the best model at the end of training based on evaluation loss.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_aug		data_aug
.gitignore		.gitignore
01-DataAnalysis.ipynb		01-DataAnalysis.ipynb
02-MakePredictions.ipynb		02-MakePredictions.ipynb
03-GetSubmissionGliner.ipynb		03-GetSubmissionGliner.ipynb
04-PredictFair.ipynb		04-PredictFair.ipynb
05-GetSubmissionFlair.ipynb		05-GetSubmissionFlair.ipynb
06-HGlinerPredictions.ipynb		06-HGlinerPredictions.ipynb
07-HGlinerSubmissions.ipynb		07-HGlinerSubmissions.ipynb
README.md		README.md
flair_finetuning.py		flair_finetuning.py
gliner_finetuing.py		gliner_finetuing.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Location Mention Recognition (LMR) Fine-Tuning Script

Overview

Requirements

Environment Variables

Dataset

Dataset Paths:

Usage

Model Setup

Fine-Tuning

Execution

Logging

Customization

Example Output

Notes

About

Releases

Packages

Languages

KameniAlexNea/location_mention_recognition

Folders and files

Latest commit

History

Repository files navigation

Location Mention Recognition (LMR) Fine-Tuning Script

Overview

Requirements

Environment Variables

Dataset

Dataset Paths:

Usage

Model Setup

Fine-Tuning

Execution

Logging

Customization

Example Output

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages