Skip to content

Latest commit

 

History

History
33 lines (24 loc) · 1.57 KB

05_hsdm2020_english_datasets.md

File metadata and controls

33 lines (24 loc) · 1.57 KB

Benchmark on English Datasets

In our paper, we also test the performance of the CRF and BiLSTM-CRF on two English datasets: nursing notes and i2b2/UTHealth 2014. Both datasets can be obtained after signing a data use agreement with the corresponding research institutes. Below, we show how to convert those datasets to the standoff format used throughout this project. The datasets are placed in data/corpus/nursing and data/corpus/i2b2. Afterwards, the datasets can be used to train and evaluate models on them.

Nursing Notes

Download raw notes and PHI annotations:

Assuming the id.text and id-phi.phrase files are located in data/raw/nursing-notes/, the nursing notes Corpus can be generated as follows:

# Convert nursing notes corpus to brat format
NN_DATA=data/raw/nursing-notes
python deidentify/dataset/nursing2brat.py \
    $NN_DATA/id.text \
    $NN_DATA/id-phi.phrase \
    $NN_DATA/brat/

# Split nursing notes into 60/20/20 train/dev/test set
python deidentify/dataset/brat2corpus.py nursing $NN_DATA/brat

i2b2/UTHealth

The script assumes that training-PHI-Gold-Set1, training-PHI-Gold-Set2, and testing-PHI-Gold-fixed are located in data/raw/i2b2/. The corpus can then be generated as follows:

python deidentify/dataset/uthealth2corpus.py