Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets
- Download and unzip a Wikipedia dump (see Data Sources below) make sure you get a monolithic XML file
- Open up
wiki_to_text.py
and edit the filename to point at your XML file. Also update the savedir location - Run
wiki_to_text.py
There are two primary data sources you'll want to use. See the table below for the root url.
Name | Description | Link |
---|---|---|
Simplified English Wikipedia | This is only about 1GB and therefore is a great test set | https://dumps.wikimedia.org/simplewiki/ |
English Wikipedia | This is all of Wikipedia, so about 80GB unpacked | https://dumps.wikimedia.org/enwiki/ |
Navigate into the latest dump. You're likley looking for the very first file in the download section. They will look something like this:
enwiki-20210401-pages-articles-multistream.xml.bz2 18.1 GB
simplewiki-20210401-pages-articles-multistream.xml.bz2 203.5 MB
Download and extract these to a storage directory. I usually shorten the folder name and filename.
https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content
Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA).
My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.