This repository contains python scripts used to produce the models that were evaluated in the 2018 University of Utah study to detect bleeding with NLP. It is organized into three sections corresponding to the three folders in the top level of the repository: machine learning code, rule-based code, and code for conducting McNemar's test to compare the performance of the RB and ED-DS models. The MachineLearning
directory contains the scripts that were used to train the models along with the models themselves. The RuleBased
directory contains the script that was used to run ConText. The script makes use of a python package called eHostess which was created by us to facilitate the annotation process. Among other things, eHostess provides a wrapper for ConText, as implemented in another python package pyConTextNLP.
A requirements file has been included listing the main python dependencies. It is also important to note that when the TfidfVectorizer
instance used in the SVM training script was serialized using pickle
, it stored a reference to the tokenizer function rather than the definition of the function. This means that when the SVM model is deserialized, it will expect to be able to find a reference to a function called __main__.tokenize
. This is not a problems so long as the model is unpickled in the SVM training script included in this repository. However, if the model is deserialized in another script then __main__.tokenize
must be defined, either by copying the tokenize
function from the SVM training script into the main script, or by importing the tokenize.pyc
module included in the TrainingScripts
directory.