Virtual environment for replicating experiments from the paper "A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain," appearing at AMIA CRI 2016. [Conference Slides]
-
Sign up for dataset access via the Oxford Text Archive
-
You will receive an email with a download link.
-
Click the link to download the
2554.zip
file, and save it todata/bnc
. -
Execute the following commands to unpack the data:
cd data/ make bnc
Note that the plaintext extraction process for BNC takes several hours.
- Sign up for an account with the Linguistic Data Consortium here
Note that this requires being part an institution with LDC access.
-
If necessary, request access to Treebank 3, then download it.
-
Copy the
LDC99T42.tgz
file to thedata/swb
directory -
Execute the following commands to unpack the data:
cd data/ make swb
-
Sign up for i2b2 data access here (requires submitting a signed Data Use Agreement)
-
After your data access is approved and you have a working login, go to the download page and download the files labeled "Concept assertion relation training data" and "Test data" from the 2010 Relations Challenge
-
Place the two downloaded .tar.gz files in the
data/i2b2
directory -
Execute the following commands to unpack the data:
cd data/ make i2b2
-
The GENIA data files are automatically sourced, in XML format, from the Treebank portion of the GENIA project.
-
Execute the following commands to unpack the data:
cd data/ make genia
Version 3.2.2 of Apache cTAKES automatically installs in install/ctakes
. The sentence chunking experiments use three components:
FilesInDirectoryCollectionReader
- Handles iterating over files in a directory.ChunkerAggregate
- Part of the core pipeline; handles chunking text (sentence segmentation, phrase segmentation, POS tagging, etc.)FileWriterCasConsumer
- Handles writing CAS results to XML files
Configuration files for using cTAKES on each corpus are located in code/ctakes
. To process each corpus, execute the following commands:
cd code/ctakes
make [i2b2|bnc|genia|swb]
This will run cTAKES and extract detected sentence boundaries from the output: bounds are written to data/[CORPUS]/ctakes-output/bounds
.
Version 3.5.2 of the Stanford CoreNLP suite and version 3.5.2 of the Stanford Parser automatically install in install/stanford-corenlp
.
Code for executing Stanford CoreNLP on each corpus is located in code/stanford-corenlp
. To process each corpus, execute the following commands:
cd code/stanford-corenlp
make [i2b2|bnc|genia|swb]
This will run Stanford CoreNLP and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/stanford-output/bounds/clean/fixed
.
Version 1.03 of the Splitta sentence segmenter automatically installs in install/splitta
.
Scripts for executing Splitta on each corpus are located in code/splitta
. To process each corpus, execute the following commands:
cd code/splitta
make [i2b2|bnc|genia|swb]
This will run Splitta and extract detected sentence boundaries from the output. Bounds are written to data/[CORPUS]/splitta-output/[nb|svm]/bounds
.
(BNC bounds require further adjustment and are placed in bounds/fixed
.)
Version 4.1.0 of the LingPipe Core software is present in install/lingpipe
by default.
If the .jar
file is missing, please go to alias-i.com to download it (the AGPL version).
Code for executing LingPipe on each corpus is located in code/lingpipe
. To process each corpus, execute the following commands:
cd code/lingpipe
make [i2b2|bnc|genia|swb]
This will run LingPipe and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/lingpipe-output/[ie|me]/bounds/clean
. (i2b2 bounds require further adjustment and are placed in bounds/clean/fixed
.)
Several analysis utilities are included for describing the corpora. The scripts are found in code/analysis
:
calculatelen
: calculates the average sentence length (in tokens) for each corpuscalculateends
: determines the set of sentence-terminal characters for each corpus, with their frequencymarkbounds
: identifies the sentence-terminal characters for each sentence in each corpus, processing both gold standard sentence bounds and predictions from each toolkit