In our paper, we also test the performance of the CRF and BiLSTM-CRF on two English
datasets: nursing notes and i2b2/UTHealth 2014.
Both datasets can be obtained after signing a data use agreement with the corresponding research institutes. Below, we show how to convert those datasets to the standoff format used throughout this project. The datasets are placed in data/corpus/nursing
and data/corpus/i2b2
. Afterwards, the datasets can be used to train and evaluate models on them.
Download raw notes and PHI annotations:
id.text
from https://physionet.org/content/deidentifiedmedicaltext/1.0/id-phi.phrase
from https://physionet.org/content/deid/1.1/
Assuming the id.text
and id-phi.phrase
files are located in data/raw/nursing-notes/
, the nursing notes Corpus can be generated as follows:
# Convert nursing notes corpus to brat format
NN_DATA=data/raw/nursing-notes
python deidentify/dataset/nursing2brat.py \
$NN_DATA/id.text \
$NN_DATA/id-phi.phrase \
$NN_DATA/brat/
# Split nursing notes into 60/20/20 train/dev/test set
python deidentify/dataset/brat2corpus.py nursing $NN_DATA/brat
The script assumes that training-PHI-Gold-Set1
, training-PHI-Gold-Set2
, and testing-PHI-Gold-fixed
are located in data/raw/i2b2/
. The corpus can then be generated as follows:
python deidentify/dataset/uthealth2corpus.py