Some notes that should not be thrown away, yet.
why not NLTK:
Hence, Stanford:
From the documentation (after adopting the shell script to use Java 8):
./stanford-postagger.sh models/wsj-0-18-left3words-distsim.tagger sample-input.txt
A test with German text:
./stanford-postagger.sh models/german-hgc.tagger german-input.txt
Problem: seems to have problems with Umlauts :-(
Create a default configuration file with comments
java -classpath stanford-postagger.jar:lib/* edu.stanford.nlp.tagger.maxent.MaxentTagger \
-genprops > myPropsFile.prop
- collect values for y and omit matches which contain a frequent y, e.g., "Ezer Weizman President Israel"