This script processes wikipedia article dumps from https://dumps.wikimedia.org/enwiki/ and gathers the word frequency distribution data. The script uses wikiextractor to fetch raw text, then strips punctuation marks and normalizes unicode dashes and apostrophes. The script then disregards words that have a digit in them, and only takes words that were used in at least 3 different articles.
The script was inspired by this article which unfortunately provided very inaccurate data with punctuation marks and other sorts of inaccuracies.
Install Git submodules:
git submodule init && git submodule update
Download the current Wikipedia dumps:
wget -np -r --accept-regex 'https://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-articles[0-9].*' https://dumps.wikimedia.org/enwiki/20150602/
Collect data:
./gather_wordfreq.py dumps.wikimedia.org/enwiki/20150602/*.bz2 > wordfreq.txt
The word frequency data for enwiki-201506
is provided at results/enwiki-20150602-words-frequency.txt:
- Total different words: 1.901.124
- Total word uses: 1.562.759.958
- Top 20 most popular words: the, of, and, in, to, was, is, for, as, on, with, by, he, that, at, from, his, it, an, were.
There is a handly little script included that converts text data into pickled dict of the logarithm of every word's probability, which can further be used for splitting combinedwords using Viterbi algorithm.
./wordfreq_to_viterbi.py < wordfreq.txt > wordfreq_log.pickle