What is this?

This script processes wikipedia article dumps from https://dumps.wikimedia.org/enwiki/ and gathers the word frequency distribution data. The script uses wikiextractor to fetch raw text, then strips punctuation marks and normalizes unicode dashes and apostrophes. The script then disregards words that have a digit in them, and only takes words that were used in at least 3 different articles.

The script was inspired by this article which unfortunately provided very inaccurate data with punctuation marks and other sorts of inaccuracies.

Usage

Install Git submodules:

git submodule init && git submodule update

Download the current Wikipedia dumps:

wget -np -r --accept-regex 'https://dumps.wikimedia.org/enwiki/20150602/enwiki-20150602-pages-articles[0-9].*' https://dumps.wikimedia.org/enwiki/20150602/

Collect data:

./gather_wordfreq.py dumps.wikimedia.org/enwiki/20150602/*.bz2 > wordfreq.txt

Results

The word frequency data for enwiki-201506 is provided at results/enwiki-20150602-words-frequency.txt:

Total different words: 1.901.124
Total word uses: 1.562.759.958
Top 20 most popular words: the, of, and, in, to, was, is, for, as, on, with, by, he, that, at, from, his, it, an, were.

Viterbi data

There is a handly little script included that converts text data into pickled dict of the logarithm of every word's probability, which can further be used for splitting combinedwords using Viterbi algorithm.

./wordfreq_to_viterbi.py < wordfreq.txt > wordfreq_log.pickle

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
results		results
wikiextractor @ 5057c13		wikiextractor @ 5057c13
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
gather_wordfreq.py		gather_wordfreq.py
wordfreq_to_viterbi.py		wordfreq_to_viterbi.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Usage

Results

Viterbi data

About

Releases

Packages

Languages

License

edwardmiles92/wikipedia-word-frequency

Folders and files

Latest commit

History

Repository files navigation

What is this?

Usage

Results

Viterbi data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages