This is a modification of Percy Liang’s implementation (version 1.3) of the Brown hierarchical word clustering algorithm that is based on a dependency language model (DLM) instead of the bigram language model.
Note that this is not a revision of those segments of the original code which are not relevant for dependency clustering. The code modification should be seen as the minimal working extension of the original code for dependency-based clustering.
Tab-separated sequence of “head”, “dependent” and “count” (see input.txt for an example), one such instance per line. Space-separated multiword sequences will be treated as one token. The program thus expects that the extraction of dependency instances with counts was already performed.
For each word type, its cluster (see output.txt for an example). In particular, each line is:
[cluster bit id] [word] [number of times word occurs in input]
If you use this code, please cite:
- Simon Šuster and Gertjan van Noord (2014) From neighborhood to parenthood: the advantages of dependency representation over bigrams in Brown clustering. COLING. See also induced clusters and experimental details.
Other references:
- Brown, et al.: Class-Based n-gram Models of Natural Language
- Liang: Semi-supervised learning for natural language processing
- On dependency language models
- Chen et al. (2012) Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
- Popel and Mareček (2010) Perplexity of n-Gram and Dependency Language Models
- Shen et al. (2008) A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model
make
Cluster input.txt into 50 clusters (–max-ind-level controls amount of verbose output):
./wcluster --text input.txt --c 50 --max-ind-level 3
# Output in input-c50-p1.out/paths
Changes to the original code were made in the following files/functions:
- wcluster.cc
- read_text_process_word()
- read_text()
- incorporate_new_phrase()
- create_initial_clusters()
- compute_cluster_distribs()
- main()
- strdb.cc
- read_text()
All modifications in the source code are marked as comments beginning with “dlm”.
Thanks to Percy Liang for clarifications about parts of original code.
(C) Copyright 2007-2012, Percy Liang
(C) Copyright Simon Šuster
Permission is granted for anyone to copy, use, or modify these programs and accompanying documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made.
These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user’s own risk.