##Parsing Wikipedia markup is basically NP-Hard
its really the worst. I'm trying my best.
wtf_wikipedia turns wikipedia article markup into JSON, and handles many vile recursive template shinanigans, half-XML implimentations, depreciated and obscure template variants, and illicit wiki-esque shorthands.
Making your own parser is never a good idea, but this library is a very detailed and deliberate creature. Place your faith well within.
npm install wtf_wikipedia
then:
var wikipedia = require("wtf_wikipedia")
//fetch wikipedia markup from api..
wikipedia.from_api("Toronto", "en", function(markup){
var obj= wikipedia.parse(markup)
// {text:[...], infobox:{}, categories:[...], images:[] }
var mayor= obj.infobox.leader_name
// "John Tory"
})
if you only want some nice plaintext, and no junk:
var text= wikipedia.plaintext(markup)
// "Toronto is the most populous city in Canada and the provincial capital..."
to call non-english wikipedia apis, add it as the second paramater to from_api
wikipedia.from_api("Toronto", "de", function(markup){
var text= wikipedia.plaintext(markup)
//Toronto ist mit 2,6 Millionen Einwohnern..
})
you may also pass the wikipedia page id as parameter instead of the page title:
wikipedia.from_api(64646, "de", function(markup){
//...
})
Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML. That means to use it, you need a [wikiscript -> virtual DOM -> screen-scraping] flow, but getting structured data out of it is a challenge.
This library is built to work well with wikipedia-to-mongo, letting you parse a wikipedia dump in nodejs easily.
#What it does
- Detects and parses redirects and disambiguation pages
- Parse infoboxes into a formatted key-value object
- Handles recursive templates and links- like [[.. [[...]] ]]
- Per-sentence plaintext and link resolution
- Parse and format internal links
- Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
- Parse images, files, and categories
- Eliminate xml, latex, css, table-sorting, and 'Egyptian hierogliphics' cruft
m ok, lets write our own parser what culd go rong
its a combination of instaview, txtwiki, and uses the inter-language data from Parsoid javascript parser.
#Methods
- .parse(markup) - turns wikipedia markup into a nice json object
- .from_api(title, lang_or_wikiid, callback) - retrieves raw contents of a wikipedia article - or other mediawiki wiki identified by its dbname
- .plaintext(markup) - returns only nice text of the article
if you're scripting this from the shell, install -g, and:
$ wikipedia_plaintext George Clooney
# George Timothy Clooney (born May 6, 1961) is an American actor ...
$ wikipedia Toronto Blue Jays
# {text:[...], infobox:{}, categories:[...], images:[] }
#Output Sample output for Royal Cinema
{
"text": {
"Intro": [
{
"text": "The Royal Cinema is an Art Moderne event venue and cinema in Toronto, Canada.",
"links": [
{
"page": "Art Moderne"
},
{
"page": "Movie theater",
"src": "cinema"
},
{
"page": "Toronto"
}
]
},
...
{
"text": "The Royal was featured in the 2013 film The F Word.",
"links": [
{
"page": "The F Word (2013 film)",
"src": "The F Word"
}
]
}
]
},
"categories": [
"National Historic Sites in Ontario",
"Cinemas and movie theatres in Toronto",
"Streamline Moderne architecture in Canada",
"Theatres completed in 1939"
],
"images": [
"Royal_Cinema.JPG"
],
"infobox": {
"former_name": {
"text": "The Pylon, The Golden Princess"
},
"address": {
"text": "608 College Street",
"links": [
{
"page": "College Street (Toronto)",
"src": "College Street"
}
]
},
"opened": {
"text": 1939
},
...
}
}
Sample Output for Whistling
{ type: 'page',
text:
{ 'Intro': [ [Object], [Object], [Object], [Object] ],
'Musical/melodic whistling':
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
'Functional whistling': [ [Object], [Object], [Object], [Object], [Object], [Object] ],
'Whistling as a form of communication':
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
'Sport': [ [Object], [Object], [Object], [Object], [Object] ],
'Superstition':
[ [Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object],
[Object] ],
' Whistling competitions': [ [Object], [Object], [Object], [Object] ]
},
'categories': [ 'Oral communication', 'Vocal music', 'Vocal skills' ],
'images': [ 'Image:Duveneck Whistling Boy.jpg' ],
'infobox': {} }
Don't be mad at me, be mad at them
MIT