SBD-Evaluation

Virtual environment for replicating experiments from the paper "A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain," appearing at AMIA CRI 2016. [Conference Slides]

Data

BNC

Sign up for dataset access via the Oxford Text Archive
You will receive an email with a download link.
Click the link to download the 2554.zip file, and save it to data/bnc.
Execute the following commands to unpack the data:
```
 cd data/
 make bnc
```
Note that the plaintext extraction process for BNC takes several hours.

Switchboard (PTB 3 release)

Sign up for an account with the Linguistic Data Consortium here

Note that this requires being part an institution with LDC access.

If necessary, request access to Treebank 3, then download it.
Copy the LDC99T42.tgz file to the data/swb directory
Execute the following commands to unpack the data:
```
 cd data/
 make swb
```

i2b2 2010 Clinical Dataset

Sign up for i2b2 data access here (requires submitting a signed Data Use Agreement)
After your data access is approved and you have a working login, go to the download page and download the files labeled "Concept assertion relation training data" and "Test data" from the 2010 Relations Challenge
Place the two downloaded .tar.gz files in the data/i2b2 directory
Execute the following commands to unpack the data:
```
 cd data/
 make i2b2
```

GENIA

The GENIA data files are automatically sourced, in XML format, from the Treebank portion of the GENIA project.
Execute the following commands to unpack the data:
```
 cd data/
 make genia
```

Toolkits

cTAKES

Version 3.2.2 of Apache cTAKES automatically installs in install/ctakes. The sentence chunking experiments use three components:

FilesInDirectoryCollectionReader - Handles iterating over files in a directory.
ChunkerAggregate - Part of the core pipeline; handles chunking text (sentence segmentation, phrase segmentation, POS tagging, etc.)
FileWriterCasConsumer - Handles writing CAS results to XML files

Configuration files for using cTAKES on each corpus are located in code/ctakes. To process each corpus, execute the following commands:

    cd code/ctakes
    make [i2b2|bnc|genia|swb]

This will run cTAKES and extract detected sentence boundaries from the output: bounds are written to data/[CORPUS]/ctakes-output/bounds.

Stanford CoreNLP

Version 3.5.2 of the Stanford CoreNLP suite and version 3.5.2 of the Stanford Parser automatically install in install/stanford-corenlp.

Code for executing Stanford CoreNLP on each corpus is located in code/stanford-corenlp. To process each corpus, execute the following commands:

    cd code/stanford-corenlp
    make [i2b2|bnc|genia|swb]

This will run Stanford CoreNLP and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/stanford-output/bounds/clean/fixed.

Splitta

Version 1.03 of the Splitta sentence segmenter automatically installs in install/splitta.

Scripts for executing Splitta on each corpus are located in code/splitta. To process each corpus, execute the following commands:

    cd code/splitta
    make [i2b2|bnc|genia|swb]

This will run Splitta and extract detected sentence boundaries from the output. Bounds are written to data/[CORPUS]/splitta-output/[nb|svm]/bounds. (BNC bounds require further adjustment and are placed in bounds/fixed.)

LingPipe

Version 4.1.0 of the LingPipe Core software is present in install/lingpipe by default.

If the .jar file is missing, please go to alias-i.com to download it (the AGPL version).

Code for executing LingPipe on each corpus is located in code/lingpipe. To process each corpus, execute the following commands:

    cd code/lingpipe
    make [i2b2|bnc|genia|swb]

This will run LingPipe and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/lingpipe-output/[ie|me]/bounds/clean. (i2b2 bounds require further adjustment and are placed in bounds/clean/fixed.)

Corpus analysis

Several analysis utilities are included for describing the corpora. The scripts are found in code/analysis:

calculatelen: calculates the average sentence length (in tokens) for each corpus
calculateends: determines the set of sentence-terminal characters for each corpus, with their frequency
markbounds: identifies the sentence-terminal characters for each sentence in each corpus, processing both gold standard sentence bounds and predictions from each toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
code		code
data		data
install		install
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Vagrantfile		Vagrantfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SBD-Evaluation

Data

BNC

Switchboard (PTB 3 release)

i2b2 2010 Clinical Dataset

GENIA

Toolkits

cTAKES

Stanford CoreNLP

Splitta

LingPipe

Corpus analysis

About

Releases

Packages

Languages

License

drgriffis/SBD-Evaluation

Folders and files

Latest commit

History

Repository files navigation

SBD-Evaluation

Data

Toolkits

Corpus analysis

About

Resources

License

Stars

Watchers

Forks

Languages