PII Data Detection Challenge

Introduction

The goal of this competition is to develop a model that detects personally identifiable information (PII) in student writing. Automating the detection and removal of PII from educational data will reduce the cost of releasing educational datasets, supporting learning science research and the development of educational tools.

Problem Statement

In today’s era of abundant educational data, PII acts as a barrier to analyzing and creating open datasets that advance education because releasing the data publicly puts students at risk. To mitigate these risks, it’s crucial to screen and cleanse educational data for PII before public release, a process that data science can streamline.

Currently, manually reviewing datasets for PII is the most reliable method, but it results in significant costs and restricts the scalability of educational datasets. Automatic PII detection techniques, primarily based on named entity recognition (NER), exist but work best for PII with common formatting, such as emails and phone numbers. These systems struggle to correctly label names and distinguish between sensitive names (e.g., a student's name) and non-sensitive names (e.g., a cited author).

Objective

The goal of this Kaggle Challenge is to develop a model capable of automating the detection of PII and distinguish between sensitive and non-sentitive PII in student writing to support learning science research and the development of educational tools.

Tasks

The competition asks competitors to assign labels to the following seven types of PII:

NAME_STUDENT: The full or partial name of a student that is not necessarily the author of the essay, excluding instructors, authors, and other person names.
EMAIL: A student’s email address.
USERNAME: A student's username on any platform.
ID_NUM: A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
PHONE_NUM: A phone number associated with a student.
URL_PERSONAL: A URL that might be used to identify a student.
STREET_ADDRESS: A full or partial street address associated with the student, such as their home address.

Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”.

The competition dataset comprises approximately 22,000 essays written by students enrolled in a massively open online course. The competitors can access 30% of the dataset while majority of the essays are reserved for the hidden test set (70%).

Dataset Exploration

Understanding the distribution of PII elements in the train dataset is important to evaluate the performance of trained models and identify potential causes of overfitting. The train dataset PII elements count are as follows:

Counter({
    'O': 4989794,
    'B-NAME_STUDENT': 1365,
    'I-NAME_STUDENT': 1096,
    'B-URL_PERSONAL': 110,
    'B-ID_NUM': 78,
    'B-EMAIL': 39,
    'I-STREET_ADDRESS': 20,
    'I-PHONE_NUM': 15,
    'B-USERNAME': 6,
    'B-PHONE_NUM': 6,
    'B-STREET_ADDRESS': 2,
    'I-URL_PERSONAL': 1,
    'I-ID_NUM': 1
})

To address the imbalance in PII types, a synthetic dataset generated by a GPT model was used, providing a more balanced PII distribution:

Counter({
    'O': 1333514, 
    'B-NAME_STUDENT': 11104, 
    'I-STREET_ADDRESS': 8577, 
    'I-NAME_STUDENT': 5667, 
    'B-EMAIL': 3794, 
    'B-STREET_ADDRESS': 3543, 
    'I-PHONE_NUM': 3389, 
    'B-PHONE_NUM': 2419, 
    'B-USERNAME': 718, 
    'B-URL_PERSONAL': 620
})

Scripts Used

preprocess_csv_to_json_token-label.py: Loops through all PII elements in a dataset to preview the ground truth labeling and validate the labeling accuracy.
clean_csv_dataset_tokens.py: Removes random characters (e.g., '—', 'â€œ‹', '\u200b') from the dataset to ensure cleanliness before inputting it to the model.

Approach & Methodology

Transfer Learning

Transfer learning is selected for this challenge as existing online models have robust NER capabilities but are not focused on sensitive PII detection. The approach involves selecting a relevant language model and applying transfer learning by retraining the existing model on the new dataset with adjusted hyperparameters.

Models Consideration

Presidio: A rule-based system by Microsoft for PII detection using predefined patterns, suitable for straightforward tasks but not optimized for student PII detection.
flairNLP: Labels ORG, NAME, LOC, MISC, running on RoBERTa large model. Limited PII classes and memory-intensive. F1 score: 94.36.
DeBERTa-v3-base: A transformer model trained on a larger dataset, and it considers content and sequence in token vectors, improving PII detection accuracy. Optimized for NER with a high F1 score of 91.37 across multiple public datasets.
Bert-base-multilingual-cased: No performance data, focuses on 10 predefined labels related to financial details.
BERT-base-NER: Similar to flairNLP, developed with TensorFlow, identifies LOC, ORG, PER, MISC. F1 score: 91.3.

DeBERTa-v3-base was chosen due to its validation on multiple public datasets, and it is a state-of-the-art transformer model optimized for NER tasks. There is high confidence on DeBERTa-v3-base to understand the context and semantics of the text to achieve the high accuracy of student PII detection. The goal is to develop a student PII detection model achieving at least a 91% accuracy score with DeBERTa-v3-base.

Training & Predictions

Initial training and prediction on max_length: 2048 resulted in an accuracy of 75%.
Training only on the synthetic dataset generated by GPT achieved worse accuracy of 68%.
Training on max_length: 3500 was initially unachievable with batch_size = 16 due to local devic memory issues. By reducing batch size and increasing the gradient parameters tuning, memory constraints is solved and accuracy significantly increased to 86%.

TrainingArguments(
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8
)
# effective_batch_size = train_batch_size * gradient_accum_steps
# effective_batch_size remain unchange and training can be completed

The training dataset text maximum length was 3298. The test dataset is expected to have similar text length as the train dataset, indicating high confidence in achieving higher accuracy with increased training and prediction max_length for the hidden test set (70% of the dataset).
Regex applied on predictions for URL and email data post-processing improved accuracy by 3%, from 89% to 92%.
Prediction max_length increased up to 2200 due to memory limitations on Kaggle, improving accuracy up to 94%.

Future Work

Exploring the Longformer model, which saves memory usage during prediction, is a potential approach to achieve higher accuracy with higher prediction max length and not bounded by memory limitation issue.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data_preprocessing		data_preprocessing
README.md		README.md
train_model_v2.py		train_model_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PII Data Detection Challenge

Introduction

Problem Statement

Objective

Tasks

Dataset Exploration

Scripts Used

Approach & Methodology

Transfer Learning

Models Consideration

Training & Predictions

Future Work

About

Releases

Packages

Languages

ChunYe173/pii_data_detection

Folders and files

Latest commit

History

Repository files navigation

PII Data Detection Challenge

Introduction

Problem Statement

Objective

Tasks

Dataset Exploration

Scripts Used

Approach & Methodology

Transfer Learning

Models Consideration

Training & Predictions

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages