Dependency Brown clustering

Syntactic extension of Brown et al. 1992 clustering algorithm

This is a modification of Percy Liang’s implementation (version 1.3) of the Brown hierarchical word clustering algorithm that is based on a dependency language model (DLM) instead of the bigram language model.

Note that this is not a revision of those segments of the original code which are not relevant for dependency clustering. The code modification should be seen as the minimal working extension of the original code for dependency-based clustering.

Input

Tab-separated sequence of “head”, “dependent” and “count” (see input.txt for an example), one such instance per line. Space-separated multiword sequences will be treated as one token. The program thus expects that the extraction of dependency instances with counts was already performed.

Output

For each word type, its cluster (see output.txt for an example). In particular, each line is:

[cluster bit id] [word] [number of times word occurs in input]

References

If you use this code, please cite:

Simon Šuster and Gertjan van Noord (2014) From neighborhood to parenthood: the advantages of dependency representation over bigrams in Brown clustering. COLING. See also induced clusters and experimental details.

Other references:

Brown, et al.: Class-Based n-gram Models of Natural Language
Liang: Semi-supervised learning for natural language processing
On dependency language models
- Chen et al. (2012) Utilizing Dependency Language Models for Graph-based Dependency Parsing Models
- Popel and Mareček (2010) Perplexity of n-Gram and Dependency Language Models
- Shen et al. (2008) A New String-to-Dependency Machine Translation Algorithm with a Target Dependency Language Model

Compile

make

Run

Cluster input.txt into 50 clusters (–max-ind-level controls amount of verbose output):

./wcluster --text input.txt --c 50 --max-ind-level 3
# Output in input-c50-p1.out/paths

Changes for dependency clustering

Changes to the original code were made in the following files/functions:

wcluster.cc
- read_text_process_word()
- read_text()
- incorporate_new_phrase()
- create_initial_clusters()
- compute_cluster_distribs()
- main()
strdb.cc
- read_text()

All modifications in the source code are marked as comments beginning with “dlm”.

Acknowledgments

Thanks to Percy Liang for clarifications about parts of original code.

Copyright

(C) Copyright Simon Šuster

Permission is granted for anyone to copy, use, or modify these programs and accompanying documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made.

These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user’s own risk.

http://www.let.rug.nl/suster/

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
basic		basic
input-c1000-p1.out		input-c1000-p1.out
Makefile		Makefile
README.md		README.md
input.txt		input.txt
output.txt		output.txt
wcluster.cc		wcluster.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dependency Brown clustering

Syntactic extension of Brown et al. 1992 clustering algorithm

Input

Output

References

Compile

Run

Changes for dependency clustering

Acknowledgments

Copyright

About

Releases

Packages

Languages

rug-compling/dep-brown-cluster

Folders and files

Latest commit

History

Repository files navigation

Dependency Brown clustering

Syntactic extension of Brown et al. 1992 clustering algorithm

Input

Output

References

Compile

Run

Changes for dependency clustering

Acknowledgments

Copyright

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages