Skip to content

Latest commit

 

History

History
35 lines (18 loc) · 904 Bytes

attic.md

File metadata and controls

35 lines (18 loc) · 904 Bytes

Attic

Some notes that should not be thrown away, yet.

German: POS tagging with the Stanford tagger

why not NLTK:

Hence, Stanford:

From the documentation (after adopting the shell script to use Java 8):

./stanford-postagger.sh models/wsj-0-18-left3words-distsim.tagger sample-input.txt

A test with German text:

./stanford-postagger.sh models/german-hgc.tagger german-input.txt

Problem: seems to have problems with Umlauts :-(

Create a default configuration file with comments

java -classpath stanford-postagger.jar:lib/* edu.stanford.nlp.tagger.maxent.MaxentTagger \
    -genprops > myPropsFile.prop

Ideas for Improvements

  • collect values for y and omit matches which contain a frequent y, e.g., "Ezer Weizman President Israel"