GitHub - me-12/wtf_wikipedia: parse wikipedia markup into json

##Parsing Wikipedia markup is basically NP-Hard

its really the worst. I'm trying my best.

wtf_wikipedia turns wikipedia article markup into JSON, and handles many vile recursive template shinanigans, half-XML implimentations, depreciated and obscure template variants, and illicit wiki-esque shorthands.

Making your own parser is never a good idea, but this library is a very detailed and deliberate creature. Place your faith well within.

npm install wtf_wikipedia

then:

var wikipedia = require("wtf_wikipedia")
//fetch wikipedia markup from api..
wikipedia.from_api("Toronto", "en", function(markup){
  var obj= wikipedia.parse(markup)
  // {text:[...], infobox:{}, categories:[...], images:[] }
  var mayor= obj.infobox.leader_name
  // "John Tory"
})

if you only want some nice plaintext, and no junk:

var text= wikipedia.plaintext(markup)
// "Toronto is the most populous city in Canada and the provincial capital..."

to call non-english wikipedia apis, add it as the second paramater to from_api

wikipedia.from_api("Toronto", "de", function(markup){
  var text= wikipedia.plaintext(markup)
  //Toronto ist mit 2,6 Millionen Einwohnern..
})

you may also pass the wikipedia page id as parameter instead of the page title:

wikipedia.from_api(64646, "de", function(markup){
  //...
})

Wikimedia's Parsoid javascript parser is the official wikiscript parser. It reliably turns wikiscript into HTML, but not valid XML. That means to use it, you need a [wikiscript -> virtual DOM -> screen-scraping] flow, but getting structured data out of it is a challenge.

This library is built to work well with wikipedia-to-mongo, letting you parse a wikipedia dump in nodejs easily.

#What it does

Detects and parses redirects and disambiguation pages
Parse infoboxes into a formatted key-value object
Handles recursive templates and links- like [[.. [[...]] ]]
Per-sentence plaintext and link resolution
Parse and format internal links
Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
Parse images, files, and categories
Eliminate xml, latex, css, table-sorting, and 'Egyptian hierogliphics' cruft

m ok, lets write our own parser what culd go rong

its a combination of instaview, txtwiki, and uses the inter-language data from Parsoid javascript parser.

#Methods

.parse(markup) - turns wikipedia markup into a nice json object
.from_api(title, lang_or_wikiid, callback) - retrieves raw contents of a wikipedia article - or other mediawiki wiki identified by its dbname
.plaintext(markup) - returns only nice text of the article

if you're scripting this from the shell, install -g, and:

$ wikipedia_plaintext George Clooney
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wikipedia Toronto Blue Jays
# {text:[...], infobox:{}, categories:[...], images:[] }

#Output Sample output for Royal Cinema

{
  "text": {
    "Intro": [
      {
        "text": "The Royal Cinema is an Art Moderne event venue and cinema in Toronto, Canada.",
        "links": [
          {
            "page": "Art Moderne"
          },
          {
            "page": "Movie theater",
            "src": "cinema"
          },
          {
            "page": "Toronto"
          }
        ]
      },
      ...
      {
        "text": "The Royal was featured in the 2013 film The F Word.",
        "links": [
          {
            "page": "The F Word (2013 film)",
            "src": "The F Word"
          }
        ]
      }
    ]
  },
  "categories": [
    "National Historic Sites in Ontario",
    "Cinemas and movie theatres in Toronto",
    "Streamline Moderne architecture in Canada",
    "Theatres completed in 1939"
  ],
  "images": [
    "Royal_Cinema.JPG"
  ],
  "infobox": {
    "former_name": {
      "text": "The Pylon, The Golden Princess"
    },
    "address": {
      "text": "608 College Street",
      "links": [
        {
          "page": "College Street (Toronto)",
          "src": "College Street"
        }
      ]
    },
    "opened": {
      "text": 1939
    },
    ...
  }
}

Sample Output for Whistling

{ type: 'page',
  text:
   { 'Intro': [ [Object], [Object], [Object], [Object] ],
     'Musical/melodic whistling':
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     'Functional whistling': [ [Object], [Object], [Object], [Object], [Object], [Object] ],
     'Whistling as a form of communication':
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     'Sport': [ [Object], [Object], [Object], [Object], [Object] ],
     'Superstition':
      [ [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object],
        [Object] ],
     ' Whistling competitions': [ [Object], [Object], [Object], [Object] ]
     },
     'categories': [ 'Oral communication', 'Vocal music', 'Vocal skills' ],
     'images': [ 'Image:Duveneck Whistling Boy.jpg' ],
     'infobox': {} }

Don't be mad at me, be mad at them

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
bin		bin
client_side		client_side
lib		lib
tests		tests
.gitignore		.gitignore
Gruntfile.js		Gruntfile.js
README.md		README.md
i18n.js		i18n.js
index.js		index.js
languages.js		languages.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

me-12/wtf_wikipedia

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages