Abstract—This project explores the use of BERT-based synthetic time series generation for enhancing classification tasks in educational contexts. By generating synthetic data to represent minority groups, the study aims to improve the fairness of machine learning models in real-time student performance assessment, while also increasing the performance of the model. While synthetic oversampling with BERT improved fairness and performance metrics on multiple baselines, it is not yet able to understand and compensate for the underlying behavioural differences of different demographic groups.
Behavioural Data Synthesis for better fairness performances
Team: LosModelosMatemagicos
Team members: Yannick Detrois, David Friou, Adrien Vauthey
The files to run can be found in the src
directory. script_oversample.py
is the main file of the project. Run it using this command when you are in the src
directory:
python script_oversample.py --mode [baseline | labels | augmentation]
The --mode
argument allows you to choose the method you want to use to oversample the data.
The configuration parameters are in the src/config.yaml
. You can change the parameters in this file according to your needs.
Make sure to change the root_name
according to what you are testing in the run. For example, if you make a simple run using the baseline
mode, you should write baseline
as the root_name
.
When choosing augmentation
, there are 4 different strategies (see Fig.2):
Type | Description | Example |
---|---|---|
1. | Balanced demographics with 50% original data and 50% synthetic data | [oo] [---] -> [oooOOO] [---...] |
2. | Balanced demographics with 100% synthetic data | [oo] [---] -> [OOO] [...] |
3. | Original demographics with 100% synthetic data | [oo] [---] -> [OO] [...] |
4. | Balanced demographics which are rebalanced with synthetic data | [oo] [---] -> [ooO] [---] |
o: sequence of demographic 1, O: SYNTHETIC sequence of demographic 1
-: sequence of demographic 2, .: SYNTHETIC sequence of demographic 2
which can be selected in the config.yaml
file by modifying the type
value under experiment
with values between 1-4 respectively (default to 1).
To change parameters for the BERT model, head in the Config.py
file. You can change the parameters in this file according to your needs.
The following libraries are required to run our project:
imbalanced-learn
imblearn
keras
matplotlib
numpy
pandas
pyyaml
seaborn
scikit-learn
tensorflow
Install them using the following command:
pip install -r requirements.txt
List of all the files we implemented or modified for the scope of this project. (modified files from the original pipeline are marked with a *)
Contains the source code of the project
Script used to run all experiments using the BERT synthetic oversampler.
Contains our implementation of the BERT model.
Functions that create the BERT model.
Functions to train and predict with the BERT model such that it can be implemented in the main pipeline.
Configuration class for centralised management of simulation parameters.
A masked language model that extends the tf.keras.Model
class.
Callback class for generating masked text predictions during training.
Manages the vectorisation encoding and decoding of state-action sequences. Every state-action sequence is transformed into a unique token, taking care of special tokens (padding, breaks). Encode-decode using either a dictionary or a numpy array.
Generate masked input and corresponding labels for masked language modeling.
Contains files and notebooks used to fine-tune the model and plot the results.
10-fold cross validation to find the best hyperparameters for the BERT model. Can be run to test hyperparameters individually or by running a grid search.
Notebook to load the results of the hyperparameter tuning and plot them.
Notebook to load the results of the experiments and plot them.
Contains the configuration file.
Configuration parameters to be loaded in script_oversample.py
.
Contains notebooks used for testing some implementations.
Notebook to test the implementation of the crossvalidation for BERT hyperparameters tuning.
Notebook to test the implementation of the BERTPipeline class.
Notebook to run and test the BERT model.
Notebook to test the implementation of the Vectorisation and masking classes.
Contains the different oversampling methods.
This class oversamples the minority class to rebalance the distribution at 50/50. It takes all of the minority samples, and then randomly picks the other to fulfill the 50/50 criterion
This code uses modified code from the "End-to-end Masked Language Modeling with BERT" (BERT.py
,MaskedLanguageModel.py
,MaskedTextGenerator.py
and masking.py
)
originally authored by Ankur Singh, available at https://github.com/keras-team/keras-io/blob/master/examples/nlp/masked_language_modeling.py
and is licensed under the Apache License, Version 2.0.