REx: Relation Extraction. Modernized re-write of the code in the master's thesis: "Relation Extraction using Distant Supervision, SVMs, and Probabalistic First-Order Logic"
This project uses sbt
for build management. If you're unfamiliar with sbt
, see the last section
for some pointers.
To download all dependencies and compile code, run sbt compile
.
To run all tests, execute sbt test
.
Moreover, to see code coverage, first run coverage
, then test
. The coverage report will be
output as an HTML file.
To produce bash scripts that will execute each individual command-line application within this
codebase, execute sbt pack
.
This project includes data that allows one to distantly supervise relation mentions in text.
The files are located under data/
: a local README
further explains the data content, format,
and purpose.
These files are large and are stored using git-lfs
. Be sure to
follow the appropriate instructions and ensure that you've set up this git
plugin (i.e. have
performed git lfs install
once).
To evaluate relation extraction performance on the UIUC relation dataset using 3 fold cross-
validation, first build the executable scripts with sbt pack
then execute:
./target/pack/bin/relation-extraction-learning-main \
learn_eval \
-li data/uiuc_cog_comp_group-entity_and_relation_recognition_corpora/all.corp \
--input_format uiuc \
-cg true \
--cost 1 \
--epsilon 0.003 \
--n_cv_folds 3
Where:
learn_eval
is the command for the script-li
specifies where the labeled relation data lives--input_format
tells the program how to interpret the file at-li
--uuic
means to use the UUIC relation classification data format-cg true
means that candidate generation is performed--cost
indicates the cost-sensitive learning parameter for the SVM--epsilon
controls the weight converage: stop when weight updates are less then this value--n_cv_folds
indicates the number of folds to perform for cross-validation
Invoking this program with the --help
flag, or with no arguments, will output a detailed help
message to stdout.
Everything within this repository is copyright (2015-) by Malcolm Greaves.
Use of this code is permitted according to the stipulations of the Apache 2 license.
When using sbt
, it is best to start it in the "interactive shell mode". To do this, simply
execute from the command line:
$ sbt
After starting up (give it a few seconds), you can execute the following commands:
compile // compiles code
pack // creates executable scripts
test // runs tests
coverage / initializes the code-coverage system, use right before 'test'
reload // re-loads the sbt build definition, including plugin definitions
update // grabs all dependencies
There are a lot more commands for sbt
. And a ton of community plugins that extend sbt
's
functionality.
Not necessary! Just a few suggestions...
We recommend using the following configuration for sbt:
sbt -J-XX:MaxPermSize=768m -J-Xmx2g -J-XX:+UseConcMarkSweepGC -J-XX:+CMSClassUnloadingEnabled
This gives some more memory to sbt
, gives it a better default GC option, and enables a better class loading &
unloading module.
Also, to limit the logging output of the Spark framework export this environment variable before running tests:
export SPARK_CONF_DIR="<YOUR_PATH_TO_THIS_REPO>/src/main/resources"