hpca is a C++ toolkit providing an efficient implementation of the Hellinger PCA for computing word embeddings. See the EACL 2014 paper for more details.
This project requires:
- Cross-platform Make (CMake) v2.6+
- GNU Make or equivalent.
- GCC or an alternative, reasonably conformant C++ compiler.
- Zlib v1.2.5
- OpenMP API (optional)
- Doxygen (in order to make documentation which is optional)
This project uses the Cross-platform Make (CMake) build system. However, we
have conveniently provided a wrapper configure script and Makefile so that
the typical build invocation of ./configure
followed by make
will work.
For a list of all possible build targets, use the command make help
.
NOTE: Users of CMake may believe that the top-level Makefile has been generated by CMake; it hasn't, so please do not delete that file.
Once the project has been built (see "BUILDING"), execute sudo make install
.
See Install for more details.
This package includes 9 different tools: preprocess
, vocab
, stats
, cooccurrence
, pca
, embeddings
, inference
, eval
and neighbors
.
Lowercase conversion and/or all numbers replaced with a special token ('0').
The corpus needs to be a tokenized plain text file containing only the sentences of the corpus.
Before running the preprocess
tool, authors strongly recommend to run a tokenizer, e.g. the Stanford Tokenizer.
java -cp stanford-parser.jar edu.stanford.nlp.process.PTBTokenizer -preserveLines corpus-sentences.txt > corpus-token.txt
preprocess
options:
-lower <int>
: Lowercased? 0=off or 1=on (default)-digit <int>
: Replace all digits with a special token? 0=off or 1=on (default)-input-file <file>
: Input file to preprocess (gzip format is allowed)-output-file <file>
: Output file to save preprocessed data-gzip <int>
: Save in gzip format? 0=off (default) or 1=on-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
preprocess -input-file corpus-token.txt -output-file corpus-clean.txt -lower 1 -digit 1 -verbose 1 -threads 8 -gzip 0
Extracting words with their respective frequency.
vocab
options:
-input-file <file>
: Input file from which to extract the vocabulary (gzip format is allowed)-vocab-file <file>
: Output file to save the vocabulary-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
vocab -input-file corpus-clean.txt -vocab-file vocab.txt -threads 8 -verbose 1
Outputting descriptive statistics about the corpus, such as the number of word types and their probability of occurrence. This tool is helpful to define the context vocabulary before constructing the co-occurrence matrix.
stats
options:
-vocab-file <file>
: Vocabulary file
Example:
stats -vocab-file vocab.txt
Constructing word-word cooccurrence statistics from the corpus.
The user should supply a vocabulary file, as produced by vocab
.
The context vocabulary can be defined either using bounds on word appearance frequencies or using a predefined context vocabulary.
cooccurrence
options:
-input-file <file>
: Input file containing the tokenized and cleaned corpus text (gzip format is allowed)-vocab-file <file>
: Vocabulary file-cxt-file <file>
: Predefined context vocabulary file-output-dir <dir>
: Output directory name to save files-min-freq <int>
: Discarding all words with a lower appearance frequency (default is 100)-upper-bound <float>
: Discarding words from the context vocabulary with a upper appearance frequency (default is 1.0)-lower-bound <float>
: Discarding words from the context vocabulary with a lower appearance frequency (default is 0.00001)-cxt-size <int>
: Symmetric context size around words(default is 5)-dyn-cxt <int>
: Dynamic context window, i.e. weighting by distance form the focus word: 0=off (default) or 1=on-memory <float>
: Soft limit for memory consumption in GB; default 4.0-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
cooccurrence -input-file corpus-clean.txt -vocab-file vocab.txt -output-dir path_to_dir -min-freq 100 -cxt-size 5 -dyn-cxt 1 -memory 4.0 -upper-bound 1.0 -lower-bound 0.00001 -verbose 1 -threads 8
cooccurence
will create the following files into the directory specified by the -output-dir
option:
coccurrence.bin
: binary file containing the countstarget_words.txt
: vocabulary of words from which embeddings will be generated (rows of the cooccurrence matrix)context_words.txt
: vocabulary of context words (columns of the cooccurrence matrix)options.txt
: files reporting the chosen options for getting word cooccurrence statistics
Randomized SVD with respect to the Hellinger distance.
The user should supply the directory where files produced by cooccurrence
are.
Let A
be a sparse matrix to be analyzed with n
rows and m
columns, and r
be the ranks of a truncated SVD (with r < min(n,m)
).
Formally, the SVD of A
is a factorization of the form A = U S Vᵀ
.
Unfortunately, computing the SVD can be extremely time-consuming as A
is often a very large matrix. Thus, we turn to randomized methods which offer significant speedups over classical methods.
This tool use some modern randomized matrix approximation techniques, developed in (amongst others) Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions, a 2009 paper by Nathan Halko, Per-Gunnar Martinsson and Joel A. Tropp.
This tool uses the external redsvd library, which implements this randomized SVD using Eigen3.
pca
options:
-input-dir <dir>
: Directory where to find thecooccurrence.bin
file-rank <int>
: Number of components to keep; default 300-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
pca -input-dir path_to_cooccurence_file -rank 300
pca
will create the following files into the directory specified by the -input-dir
option:
svd.U
: orthonomal matrix Usvd.S
: diagonal matrix S whose entries are singular valuessvd.V
: orthonomal matrix V
Generating word embeddings from the Hellinger PCA.
The user should supply the directory where files produced by pca
are.
embeddings
options:
-input-dir <dir>
: Directory where to find files from the randomized SVD-output-name <string>
: Filename for embeddings file which will be placed in (default is words.txt)-dim <int>
: Word vector dimension; default 100-eig <float>
: Eigenvalue weighting (0.0, 0.5 or 1.0); default is 0.0-norm <int>
: Are vectors normalized to unit length? 0=off (default) or 1=on-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
embeddings -input-dir path_to_svd_files -output-name words.txt -eig 0.0 -dim 100 -norm 0
Inferring new word embeddings from an existing Hellinger PCA.
The user should supply the directories where to find cooccurrence statistics of the new words and files produced by pca
.
inference
options:
-cooc-dir <dir>
: Directory where to find thecooccurrence.bin
file of the new words-pca-dir <dir>
: Directory where to find files from the randomized SVD-output-name <string>
: Filename for embeddings file which will be placed in (default is inference_words.txt)-dim <int>
: Word vector dimension; default 100-eig <float>
: Eigenvalue weighting (0.0, 0.5 or 1.0); default is 0.0-norm <int>
: Are vectors normalized to unit length? 0=off (default) or 1=on-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
inference -cooc-dir path_to_cooccurrence_files -pca-dir path_to_svd_files -output-name inference_words.txt -eig 0.0 -dim 100 -norm 0
This tool provides a quick evaluation of the word embeddings produced by embeddings
for an English corpus.
Console output can be redirected to a file.
It contains the following evaluation datasets:
- The WordSimilarity-353 Test Collection
- The Rubenstein and Goodenough dataset
- The Stanford Rare Word (RW) Similarity Dataset
- The Microsoft Research Syntactic Analogies Dataset
- The Google Semantic Analogies Dataset
eval
options:
-word-file <file>
: File containing word embeddings to evaluate-vocab-file <file>
: File containing word vocabulary-lower <int>
: Lowercased datasets? 0=off or 1=on (default)-threads <int>
: Number of threads; default 8-ws353 <int>
: Do WordSim-353 evaluation: 0=off or 1=on (default)-rg65 <int>
: Do Rubenstein and Goodenough 1965 evaluation: 0=off or 1=on (default)-rw <int>
: Do Stanford Rare Word evaluation: 0=off or 1=on (default)-syn <int>
: Do Microsoft Research Syntactic Analogies: 0=off or 1=on (default)-sem <int>
: Do Google Semantic Analogies: 0=off or 1=on (default)-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
eval -word-file words.txt -vocab-file target_vocab.txt -ws353 1 -rg65 1 -rw 1 -syn 1 -sem 1 -verbose 1 > words-eval.txt
NOTE: To speed up the implementation of the analogies, candidate solutions come from a closed vocabulary.
An exploratory tool to evaluate word embeddings quality. The user should supply the file containing the word embeddings and its corresponding vocabulary. By default, this tool runs in interactive mode. Otherwise, a file containing a list of words can be provided.
neighbors
options:
-word-file <file>
: File containing word embeddings to evaluate-vocab-file <file>
: File containing word vocabulary-list-file <file>
: File containing a list of words from which the nearest neighbors will be computed, otherwise interactive mode-top <int>
: Number of nearest neighbors; default 10-threads <int>
: Number of threads; default 8-verbose <int>
: Set verbosity: 0=off or 1=on (default)
Example:
neighbors -word-file words.txt -vocab-file target_vocab.txt -list-file words_list.txt -top 5 > nearest_neighbors.txt
For a full demo example, run:
./demo.sh
This script will download a tokenized version of the Reuters Corpus Volume I (RCV1) and compute word embeddings out of it.
- Rémi Lebret: [email protected]
- Eigen3 -- http://eigen.tuxfamily.org
Eigen 3 is a lightweight C++ template library for vector and matrix math, a.k.a. linear algebra.
- redsvd -- https://code.google.com/p/redsvd/
redsvd is a library for solving several matrix decompositions including singular value decomposition (SVD), principal component analysis (PCA), and eigen value decomposition. redsvd uses Eigen3.