The goal of this competition is to develop a model that detects personally identifiable information (PII) in student writing. Automating the detection and removal of PII from educational data will reduce the cost of releasing educational datasets, supporting learning science research and the development of educational tools.
In today’s era of abundant educational data, PII acts as a barrier to analyzing and creating open datasets that advance education because releasing the data publicly puts students at risk. To mitigate these risks, it’s crucial to screen and cleanse educational data for PII before public release, a process that data science can streamline.
Currently, manually reviewing datasets for PII is the most reliable method, but it results in significant costs and restricts the scalability of educational datasets. Automatic PII detection techniques, primarily based on named entity recognition (NER), exist but work best for PII with common formatting, such as emails and phone numbers. These systems struggle to correctly label names and distinguish between sensitive names (e.g., a student's name) and non-sensitive names (e.g., a cited author).
The goal of this Kaggle Challenge is to develop a model capable of automating the detection of PII and distinguish between sensitive and non-sentitive PII in student writing to support learning science research and the development of educational tools.
The competition asks competitors to assign labels to the following seven types of PII:
- NAME_STUDENT: The full or partial name of a student that is not necessarily the author of the essay, excluding instructors, authors, and other person names.
- EMAIL: A student’s email address.
- USERNAME: A student's username on any platform.
- ID_NUM: A number or sequence of characters that could be used to identify a student, such as a student ID or a social security number.
- PHONE_NUM: A phone number associated with a student.
- URL_PERSONAL: A URL that might be used to identify a student.
- STREET_ADDRESS: A full or partial street address associated with the student, such as their home address.
Token labels are presented in BIO (Beginning, Inner, Outer) format. The PII type is prefixed with “B-” when it is the beginning of an entity. If the token is a continuation of an entity, it is prefixed with “I-”. Tokens that are not PII are labeled “O”.
The competition dataset comprises approximately 22,000 essays written by students enrolled in a massively open online course. The competitors can access 30% of the dataset while majority of the essays are reserved for the hidden test set (70%).
Understanding the distribution of PII elements in the train dataset is important to evaluate the performance of trained models and identify potential causes of overfitting. The train dataset PII elements count are as follows:
Counter({
'O': 4989794,
'B-NAME_STUDENT': 1365,
'I-NAME_STUDENT': 1096,
'B-URL_PERSONAL': 110,
'B-ID_NUM': 78,
'B-EMAIL': 39,
'I-STREET_ADDRESS': 20,
'I-PHONE_NUM': 15,
'B-USERNAME': 6,
'B-PHONE_NUM': 6,
'B-STREET_ADDRESS': 2,
'I-URL_PERSONAL': 1,
'I-ID_NUM': 1
})
To address the imbalance in PII types, a synthetic dataset generated by a GPT model was used, providing a more balanced PII distribution:
Counter({
'O': 1333514,
'B-NAME_STUDENT': 11104,
'I-STREET_ADDRESS': 8577,
'I-NAME_STUDENT': 5667,
'B-EMAIL': 3794,
'B-STREET_ADDRESS': 3543,
'I-PHONE_NUM': 3389,
'B-PHONE_NUM': 2419,
'B-USERNAME': 718,
'B-URL_PERSONAL': 620
})
preprocess_csv_to_json_token-label.py
: Loops through all PII elements in a dataset to preview the ground truth labeling and validate the labeling accuracy.clean_csv_dataset_tokens.py
: Removes random characters (e.g., '—', '“‹', '\u200b') from the dataset to ensure cleanliness before inputting it to the model.
Transfer learning is selected for this challenge as existing online models have robust NER capabilities but are not focused on sensitive PII detection. The approach involves selecting a relevant language model and applying transfer learning by retraining the existing model on the new dataset with adjusted hyperparameters.
- Presidio: A rule-based system by Microsoft for PII detection using predefined patterns, suitable for straightforward tasks but not optimized for student PII detection.
- flairNLP: Labels ORG, NAME, LOC, MISC, running on RoBERTa large model. Limited PII classes and memory-intensive. F1 score: 94.36.
- DeBERTa-v3-base: A transformer model trained on a larger dataset, and it considers content and sequence in token vectors, improving PII detection accuracy. Optimized for NER with a high F1 score of 91.37 across multiple public datasets.
- Bert-base-multilingual-cased: No performance data, focuses on 10 predefined labels related to financial details.
- BERT-base-NER: Similar to flairNLP, developed with TensorFlow, identifies LOC, ORG, PER, MISC. F1 score: 91.3.
DeBERTa-v3-base was chosen due to its validation on multiple public datasets, and it is a state-of-the-art transformer model optimized for NER tasks. There is high confidence on DeBERTa-v3-base to understand the context and semantics of the text to achieve the high accuracy of student PII detection. The goal is to develop a student PII detection model achieving at least a 91% accuracy score with DeBERTa-v3-base.
- Initial training and prediction on
max_length: 2048
resulted in an accuracy of 75%. - Training only on the synthetic dataset generated by GPT achieved worse accuracy of 68%.
- Training on
max_length: 3500
was initially unachievable withbatch_size = 16
due to local devic memory issues. By reducing batch size and increasing the gradient parameters tuning, memory constraints is solved and accuracy significantly increased to 86%.
TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=8
)
# effective_batch_size = train_batch_size * gradient_accum_steps
# effective_batch_size remain unchange and training can be completed
- The training dataset text maximum length was 3298. The test dataset is expected to have similar text length as the train dataset, indicating high confidence in achieving higher accuracy with increased training and prediction max_length for the hidden test set (70% of the dataset).
- Regex applied on predictions for URL and email data post-processing improved accuracy by 3%, from 89% to 92%.
- Prediction max_length increased up to
2200
due to memory limitations on Kaggle, improving accuracy up to 94%.
Exploring the Longformer model, which saves memory usage during prediction, is a potential approach to achieve higher accuracy with higher prediction max length and not bounded by memory limitation issue.