Skip to content

Commit

Permalink
Merge pull request #217 from spencermountain/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
spencermountain authored Dec 4, 2018
2 parents d400df0 + c3329ce commit aabde6e
Show file tree
Hide file tree
Showing 116 changed files with 6,428 additions and 4,264 deletions.
5 changes: 0 additions & 5 deletions .babelrc

This file was deleted.

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ npm-debug.log
package-lock.json
.nyc_output
*.DS_Store
*.tsv
4 changes: 4 additions & 0 deletions .npmignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
tests/
demo/
scripts/
examples/
.babelrc
.esformatter
.eslintrc
scratch.js
*.tsv
changelog.md
contributing.md
83 changes: 63 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
<div align="center">
<a href="https://www.codacy.com/app/spencerkelly86/wtf_wikipedia">
<img src="https://api.codacy.com/project/badge/grade/e84f69487c9348ba9cd8e31031a05a4f" />
</a>
<a href="https://npmjs.org/package/wtf_wikipedia">
<img src="https://img.shields.io/npm/v/wtf_wikipedia.svg?style=flat-square" />
</a>
<a href="https://codecov.io/gh/spencermountain/wtf_wikipedia">
<img src="https://codecov.io/gh/spencermountain/wtf_wikipedia/branch/master/graph/badge.svg" />
</a>

<div>wikipedia markup parser</div>
<div><img src="https://cloud.githubusercontent.com/assets/399657/23590290/ede73772-01aa-11e7-8915-181ef21027bc.png" /></div>

<div align="center">
<a href="https://npmjs.org/package/wtf_wikipedia">
<img src="https://img.shields.io/npm/v/wtf_wikipedia.svg?style=flat-square" />
</a>
<a href="https://codecov.io/gh/spencermountain/wtf_wikipedia">
<img src="https://codecov.io/gh/spencermountain/wtf_wikipedia/branch/master/graph/badge.svg" />
</a>
<a href="https://unpkg.com/wtf_wikipedia/builds/wtf_wikipedia.min.js">
<img src="https://badge-size.herokuapp.com/spencermountain/wtf_wikipedia/master/builds/wtf_wikipedia.min.js" />
</a>
</div>

<sub>
by
<a href="https://spencermountain.github.io/">Spencer Kelly</a> and
Expand Down Expand Up @@ -38,17 +44,21 @@ Consider:
* the unexplained [hashing scheme](https://commons.wikimedia.org/wiki/Commons:FAQ#What_are_the_strangely_named_components_in_file_paths.3F) for image paths,
* the [custom encoding](https://en.wikipedia.org/wiki/Wikipedia:Naming_conventions_(technical_restrictions)) of whitespace and punctuation,
* [right-to-left](https://www.youtube.com/watch?v=xpumLsaAWGw) values in left-to-right templates.
* as of Nov-2018, there are [634,755](https://s3-us-west-1.amazonaws.com/spencer-scratch/allTemplates-2018-10-26.tsv) templates in wikipedia

**wtf_wikipedia** supports many ***recursive shenanigans***, depreciated and **obscure template**
variants, and illicit 'wiki-esque' shorthands.

![image](https://user-images.githubusercontent.com/399657/43598341-75ca8f94-9652-11e8-9b91-cabae4fb1dce.png)

<div align="center">
It will try it's best, and fail in reasonable ways.

> → building your own parser is never a good idea →
>
> ← but this library aims to be *****a straight-forward***** way to get data ***out of wikipedia***
<p></p>
<i>building your own parser is never a good idea</i>
<div><i>
but this library aims to be <b>a straight-forward</b> way to get data <b>out of wikipedia</b>
</i></div>
</div>

<div align="center">
<sub>
Expand All @@ -59,7 +69,13 @@ It will try it's best, and fail in reasonable ways.
</sub>
</div>
<div align="center">
<h3><a href="https://beta.observablehq.com/@spencermountain/wtf_wikipedia">Demo</a></h3>
<h3>
<a href="https://beta.observablehq.com/@spencermountain/wtf_wikipedia">Demo</a>
&nbsp; • &nbsp;
<a href="https://beta.observablehq.com/@spencermountain/wtf_wikipedia-tutorial">Tutorial</a>
&nbsp; • &nbsp;
<a href="https://beta.observablehq.com/@spencermountain/wtf_wikipedia-api">Api</a>
</h3>
</div>

## well ok then,
Expand Down Expand Up @@ -114,7 +130,7 @@ wtf.fetch('Whistling').then(doc => {
## But what about...

### Parsoid:
Wikimedia's [Parsoid javascript parser](https://www.mediawiki.org/wiki/Parsoid) is the official wikiscript parser. It
Wikimedia's [Parsoid javascript parser](https://www.mediawiki.org/wiki/Parsoid) is the official wikiscript parser, and is pretty cool. It
reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:
Expand Down Expand Up @@ -146,6 +162,11 @@ var doc = wtf(wikiText, [options])
wtf.fetch(64646, 'en', (err, doc) => {
console.log(doc.categories());
});

//get a random german page
wtf.random('de').then(doc => {
console.log(doc.text())
});
```

<div align="center">
Expand Down Expand Up @@ -207,8 +228,10 @@ flip your wikimedia markup into a `Document` object

```javascript
import wtf from 'wtf_wikipedia'
wtf("==In Popular Culture==\n*harry potter's wand\n* the simpsons fence");
// Document {plaintext(), html(), latex()...}
wtf(`==In Popular Culture==
* harry potter's wand
* the simpsons fence`);
// Document {text(), html(), lists()...}
```

### **wtf.fetch(title, [lang_or_wikiid], [options], [callback])**
Expand All @@ -231,6 +254,26 @@ wtf.fetch(64646, 'de').then(console.log).catch(console.log)
```
the fetch method follows redirects.

the optional-callback pattern is the same for **wtf.random()**

`wtf.random(lang, options, callback)`
`wtf.random(lang, options).then(doc=>doc.infobox())`

### **wtf.category(title, [lang_or_wikiid], [options], [callback])**
retrieves all pages and sub-categories belonging to a given category:
```js
let result = await wtf.category('Category:Politicians_from_Paris');
//{
// pages: [{title: 'Paul Bacon', pageid: 1266127 }, ...],
// categories: [ {title: 'Category:Mayors of Paris' } ]
//}

//this format works too
wtf.category('National Basketball Association teams', 'en', (err, result)=>{
//
});
```

### **doc.text()**
returns only nice plain-text of the article
```js
Expand Down Expand Up @@ -297,9 +340,9 @@ $ wtf_wikipedia Toronto Blue Jays --json

### Good practice:
The wikipedia api is [pretty welcoming](https://www.mediawiki.org/wiki/API:Etiquette#Request_limit) though recommends three things, if you're going to hit it heavily -
* 1️⃣ pass a `Api-User-Agent` as something so they can use to easily throttle bad scripts
* 2️⃣ bundle multiple pages into one request as an array
* 3️⃣ run it serially, or at least, [slowly](https://www.npmjs.com/package/slow).
* pass a `Api-User-Agent` as something so they can use to easily throttle bad scripts
* bundle multiple pages into one request as an array
* run it serially, or at least, [slowly](https://www.npmjs.com/package/slow).
```js
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
'Api-User-Agent': '[email protected]'
Expand Down
6 changes: 6 additions & 0 deletions api/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -506,6 +506,12 @@
"example": "",
"description": "lookup properties from their key"
},
{
"name": "template",
"returns": "String",
"example": "",
"description": "which infobox, eg 'Infobox Person'"
},


{
Expand Down
Loading

0 comments on commit aabde6e

Please sign in to comment.