Skip to content

Virtual environment for replicating experiments from the paper "A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain," appearing at AMIA CRI 2016.

License

Notifications You must be signed in to change notification settings

drgriffis/SBD-Evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SBD-Evaluation

Virtual environment for replicating experiments from the paper "A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain," appearing at AMIA CRI 2016. [Conference Slides]

Data

  1. Sign up for dataset access via the Oxford Text Archive

  2. You will receive an email with a download link.

  3. Click the link to download the 2554.zip file, and save it to data/bnc.

  4. Execute the following commands to unpack the data:

     cd data/
     make bnc
    

    Note that the plaintext extraction process for BNC takes several hours.

  1. Sign up for an account with the Linguistic Data Consortium here

Note that this requires being part an institution with LDC access.

  1. If necessary, request access to Treebank 3, then download it.

  2. Copy the LDC99T42.tgz file to the data/swb directory

  3. Execute the following commands to unpack the data:

     cd data/
     make swb
    
  1. Sign up for i2b2 data access here (requires submitting a signed Data Use Agreement)

  2. After your data access is approved and you have a working login, go to the download page and download the files labeled "Concept assertion relation training data" and "Test data" from the 2010 Relations Challenge

  3. Place the two downloaded .tar.gz files in the data/i2b2 directory

  4. Execute the following commands to unpack the data:

     cd data/
     make i2b2
    
  1. The GENIA data files are automatically sourced, in XML format, from the Treebank portion of the GENIA project.

  2. Execute the following commands to unpack the data:

     cd data/
     make genia
    

Toolkits

Version 3.2.2 of Apache cTAKES automatically installs in install/ctakes. The sentence chunking experiments use three components:

  1. FilesInDirectoryCollectionReader - Handles iterating over files in a directory.
  2. ChunkerAggregate - Part of the core pipeline; handles chunking text (sentence segmentation, phrase segmentation, POS tagging, etc.)
  3. FileWriterCasConsumer - Handles writing CAS results to XML files

Configuration files for using cTAKES on each corpus are located in code/ctakes. To process each corpus, execute the following commands:

    cd code/ctakes
    make [i2b2|bnc|genia|swb]

This will run cTAKES and extract detected sentence boundaries from the output: bounds are written to data/[CORPUS]/ctakes-output/bounds.

Version 3.5.2 of the Stanford CoreNLP suite and version 3.5.2 of the Stanford Parser automatically install in install/stanford-corenlp.

Code for executing Stanford CoreNLP on each corpus is located in code/stanford-corenlp. To process each corpus, execute the following commands:

    cd code/stanford-corenlp
    make [i2b2|bnc|genia|swb]

This will run Stanford CoreNLP and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/stanford-output/bounds/clean/fixed.

Version 1.03 of the Splitta sentence segmenter automatically installs in install/splitta.

Scripts for executing Splitta on each corpus are located in code/splitta. To process each corpus, execute the following commands:

    cd code/splitta
    make [i2b2|bnc|genia|swb]

This will run Splitta and extract detected sentence boundaries from the output. Bounds are written to data/[CORPUS]/splitta-output/[nb|svm]/bounds. (BNC bounds require further adjustment and are placed in bounds/fixed.)

Version 4.1.0 of the LingPipe Core software is present in install/lingpipe by default.

If the .jar file is missing, please go to alias-i.com to download it (the AGPL version).

Code for executing LingPipe on each corpus is located in code/lingpipe. To process each corpus, execute the following commands:

    cd code/lingpipe
    make [i2b2|bnc|genia|swb]

This will run LingPipe and extract detected sentence boundaries from the output, with some cleaning. Bounds are written to data/[CORPUS]/lingpipe-output/[ie|me]/bounds/clean. (i2b2 bounds require further adjustment and are placed in bounds/clean/fixed.)

Corpus analysis

Several analysis utilities are included for describing the corpora. The scripts are found in code/analysis:

  • calculatelen: calculates the average sentence length (in tokens) for each corpus
  • calculateends: determines the set of sentence-terminal characters for each corpus, with their frequency
  • markbounds: identifies the sentence-terminal characters for each sentence in each corpus, processing both gold standard sentence bounds and predictions from each toolkit

About

Virtual environment for replicating experiments from the paper "A Quantitative and Qualitative Evaluation of Sentence Boundary Detection for the Clinical Domain," appearing at AMIA CRI 2016.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published