Demo at Taipei.py on 2013/04/05.
Slides : http://www.slideshare.net/rueshyna/text-mining-20087054
Video : https://www.youtube.com/watch?v=svGf5Vxyx60&feature=c4-feed-u
If you have large data then you take more time to run program...
I used title column in train-sample file from Stack Overflow as example in this talk.
It will count terms and plot a chart.
> python freq.py
Need to download tagged model first.
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
This model use Penn Treebank II Tags style.
> python pos.py
In here, the Penn Treebank II Tags was too detail, so I simplified tags. Please refer to NLTK api doc for simplified tags.
> python freq_pos.py
In here, window size of collocation was set 5 which means it will observe next 5 words.
I forgot to preprocess lower case problem in this program, please careful about case problem.
> python collocation.py
Use language model to make sentence.
> python lm.py