The pyngram.py
script uses the multiprocessing module to calculate ngrams in a txt file.
Calculate all the 3-grams in the txt file HuckleberryFinn.txt
:
python pyngram.py HuckleberryFinn.txt 3
which will create a tab seperated file called 3grams.tsv
with the sorted counts of 3-grams in The Adventures of Huckleberry Finn by Mark Twain. Take a look at the output file:
head -15 3grams.tsv
3gram count
out of the 60
was going to 47
said it was 35
by and by 35
a lot of 32
and then he 29
the old man 28
the middle of 27
kind of a 25
and it was 25
it was a 24
one of them 23
a couple of 23
there was a 22
Although the demo input txt file is small, the ngram counting script in this repo is capable of counting ngrams in much larger input files (which are currently first turned into one long list of lines/strings).