-
Notifications
You must be signed in to change notification settings - Fork 78
Data captured
aliases, categories, claims, contributors, description, disambiguation, exhtml, exrest, extext, extract, files, html, image, infobox, iwlinks, label, labels, languages, lead, length, links, modified, pageid, parsetree, properties, random, redirected, redirects, title, url, url_raw, views, watchers, what, wikibase, wikidata, wikidata_url, wikitext
<list> alternative names for items in Wikidata
This is the list of aliases from Wikidata.
Example:
>>> gandhi.data['aliases'][:3]
[u'M K Gandhi',
u'Mohandas Gandhi',
u'Bapu']
Methods: get_wikidata()
<list> all categories page belongs to
This is a list of all categories that appear on the page and which the page belongs to. We get this from the Mediawiki API:Categories module.
Example:
>>> gandhi.data['categories'][:3]
[u'Category:1869 births',
u'Category:1948 deaths',
u'Category:19th-century Indian lawyers']
Methods: get_more(), get_querymore()
<dict> Wikidata statements
These are Wikidata page statements reduced to just the entity ("P" property or "Q" item) claims. Entities are stored in page.data['labels']
and rewritten with labels into page.data['wikidata']
Example:
>>> gandhi.data['claims']
{u'P1003': [u'RUNLRAUTH77108557', u'RUNLRAUTH777162'],
u'P1005': [u'29946'],
u'P1006': [u'068712030', u'352073632'],
u'P1015': [u'90687101'],
u'P1017': [u'ADV10274916'],
u'P102': [u'Q10225'],
u'P103': [u'Q5137'],
u'P106': [u'Q82955',
...
Methods: get_wikidata()
<int> total number of contributors to this page
This is the total of logged-in AND anonymous contributors from Mediawiki API:Contributors. Fascinating!
Example:
>>> gandhi.data['contributors']
2608
Methods: get_more(), get_querymore()
<str> short description
This is a short description of the page from Mediawiki or Wikidata. When it is available it is often very enlightening.
Example:
>>> gandhi.data['description']
u'pre-eminent leader of Indian nationalism during British-ruled India'
Methods: get_query(), get_wikidata()
<int> count of disambiguation links
This indicates that the resulting page is a disambiguation page. The list of pages that the disambiguation page links to can be found in page.data['links'].
Example:
>>> page = wptools.page('Gandhi (disambiguation)').get()
...
>>> page.data['disambiguation']
29
>>> page.data['links'][:3]
[u'Anne McCue', u'Gandhi (American band)', u'Gandhi (Costa Rican band)']
Methods: get_query()
<str> RESTBase page extract in HTML
This is the RESTBase "extract_html" (summary) in limited HTML. It does not contain interwiki links, citations or infoboxes. It is basically a trunctated version of page.data['extract']
.
Example:
>>> gandhi.data['exhtml'][:80]
<p>Mahātmā <b>Mohandas Karamchand Gandhi</b> (<span></span>; <small>Hindustani:
Methods: get_restbase('/page/summary')
<str> RESTBase page extract in plain text
This is the RESTBase "extract" (summary) in plain text. It is basically a truncated version of page.data['extext']
.
Example:
>>> gandhi.data['exrest'][:80]
Mahātmā Mohandas Karamchand Gandhi (; Hindustani: [ˈmoːɦənd̪aːs ˈkərəmtʃənd̪ ˈɡa
Methods: get_restbase('/page/summary')
<str> page extract in plain text
This is the lead section, or summary, of the page in plain text. It does not include infoboxes, and some data is removed by the API.
Example:
>>> gandhi.data['extext'][:80]
Mahātmā **Mohandas Karamchand Gandhi** (; Hindustani: [ˈmoːɦənd̪aːs
ˈkərəmtʃənd̪
Methods: get_query()
<str> page extract in limited HTML
This is the lead section, or summary, of the page in limited HTML from Extension:TextExtracts. It is simple markup only; no wikilinks, citations, infoboxes, etc.
Example:
>>> gandhi.data['extract'][:80]
<p>Mahātmā <b>Mohandas Karamchand Gandhi</b> (<span></span>; <small>Hindustani:
Methods: get_query()
<list> list of files embedded in this page
This is the list of embedded (image, audio, video) files included on this page from Mediawiki API:Images. Awesome!
Example:
>>> gandhi.data['files'][:3]
[u'File:Aum Om red.svg',
u'File:Commons-logo.svg',
u'File:Conscience and law.jpg']
Methods: get_more(), get_querymore()
<str> page content in full HTML
This is the most performant way to get page HTML outside of running your own Mediawiki instance. It is verbatim what Mediawiki is serving for that page.
Example:
>>> gandhi.data['html'][:80]
u'<!DOCTYPE html>\n<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki'
Methods: get_restbase('/page/html/{title}')
<list> representative image(s) for this page
The epitome ("single most appropriate") image data contained in each API response is stored in this attribute with a kind
label. They are often all the same image file. These are NOT all the images/files contained in a page—that would be page.data['files']
—, only the so-called PageImage that aims to be a representative image for the page. See the Images documentation for more details.
Example:
>>> gandhi.pageimage()
['query-pageimage',
'query-thumbnail',
'parse-image',
'wikidata-image',
'restbase-image',
'restbase-thumb']
>>> gandhi.pageimage('thumb')
{'file': u'Portrait_Gandhi.jpg',
u'height': 240,
'kind': 'query-thumbnail',
'url': u'https://upload.wikimedia.org/wikipedia/commons/thumb/d/d1/Portrait_Gandhi.jpg/160px-Portrait_Gandhi.jpg',
u'width': 160}
Methods: get_imageinfo(), get_parse(), get_query(), get_restbase(), get_wikidata()
<dict> parsed infobox data
This attribute contains Infobox template data extracted from a page's parsetree. Unfortunately, there is usually more data available from a page's infobox than from wikidata. See the Infoboxes documentation for details.
Example:
>>> gandhi.data['infobox']
{'alma_mater': [[University College London]]<ext><name>ref</name><attr/><inner>{{cite book|author1=Jeffrey M. Shaw |author2=Timothy J. Demy |title=War and Religion: An Encyclopedia of Faith and Conflict |url=https://books.google.com/books?id=KDlFDgAAQBAJ&pg=PA309|year=2017|publisher=ABC-CLIO|isbn=978-1-61069-517-6|pages=309 }}</inner><close></ref></close></ext><br>[[Inner Temple]]',
'alt': u'The face of Gandhi in old age\u2014smiling, wearing glasses, and with a white sash over his right shoulder',
'birth_date': '{{Birth date|df|=|yes|1869|10|2}}',
'birth_name': 'Mohandas Karamchand Gandhi',
'birth_place': [[Porbandar]], [[Porbandar State]], [[Kathiawar Agency]], [[Bombay Presidency]], [[British Raj|British India]]<ext><name>ref</name><attr> name="Gandhi DOB"</attr><inner>[[#Rajmohan|Gandhi, Rajmohan (2006)]] [https://books.google.com/?id=FauJL7LKXmkC pp. 1–3].</inner><close></ref></close></ext><br />(present-day [[Gujarat]], [[India]])',
'children': '{{hlist|[[Harilal Gandhi|Harilal]]|[[Manilal Gandhi|Manilal]]|[[Ramdas Gandhi|Ramdas]]|[[Devdas Gandhi|Devdas]]}}',
'death_cause': '[[Assassination of Mahatma Gandhi|Assassination]]',
'death_date': '{{Death date and age|df|=|yes|1948|1|30|1869|10|2}}',
'death_place': [[New Delhi]], [[Delhi]], [[Dominion of India]] (present-day [[India]])',
'father': '[[Karamchand Uttamchand Gandhi|Karamchand Gandhi]]',
'honorific_prefix': u'[[Mah\u0101tm\u0101]]',
'image': 'MKGandhi.jpg',
'known_for': [[Indian Independence Movement]],<br>[[Peace movement]]',
'mother': 'Putlibai Gandhi',
'movement': '[[Indian independence movement]]',
'name': 'Mohandas Karamchand Gandhi',
'nationality': '[[Indian people|Indian]]',
'native_name': u'\u0aae\u0acb\u0ab9\u0aa8\u0aa6\u0abe\u0ab8 \u0a95\u0ab0\u0aae\u0a9a\u0a82\u0aa6 \u0a97\u0abe\u0a82\u0aa7\u0ac0',
'native_name_lang': 'Gujarati',
'occupation': '{{hlist|Lawyer|Politician|Activist|Writer|Soldier}}',
'other_names': 'Mahatma Gandhi, Bapu ji, Gandhi ji',
'party': '[[Indian National Congress]]',
'resting_place': [[Raj Ghat and associated memorials|Raj Ghat]], [[Delhi]], [[India]]',
'signature': 'Mohandas K. Gandhi signature.svg',
'spouse': '{{marriage|[[Kasturba Gandhi]]|1883|1944|end|=|died}}'}
Methods: get_parse()
<list> list of interwiki links
This is the list of interwiki links from the page's parsetree.
Example:
>>> gandhi.data['links'][:3]
[u'https://biblio.wiki/wiki/Mohandas_K._Gandhi',
u'https://commons.wikimedia.org/wiki/Special:Search/Mohandas_K._Gandhi',
u'https://en.wikiquote.org/wiki/Special:Search/Mohandas_K._Gandhi']
Methods: get_parse()
<str> Wikidata label
This is the Wikidata label (common name) in the language specified.
Example:d
>>> gandhi.data['label']
u'Mahatma Gandhi'
Methods: get_query(), get_wikidata()
<str> Wikidata entity labels
These are Wikidata labels, or common names for Wikidata page entities in the language specified needed to present Wikidata claims in page.data['wikidata']
.
Example:
>>> gandhi.data['labels']
{u'Q808967': u'barrister',
u'Q82955': u'politician',
u'Q84': u'London',
u'Q9089': u'Hinduism',
u'Q9441': u'Gautama Buddha',
...
Methods: get_labels()
<list> languages available
This is the list of languages that this page can be found in on other Wikipedias from Mediawiki API:Langlinks. Each entry contains the language code and the name of the page rendered in that language. What a treasure!
Example:
>>> gandhi.data['languages'][:3]
[{u'lang': u'af', u'title': u'Mahatma Gandhi'},
{u'lang': u'als', u'title': u'Mohandas Karamchand Gandhi'},
{u'lang': u'am', u'title': u'\u121b\u1205\u1270\u121b \u130b\u1295\u12f2'}]
Methods: get_more(), get_querymore()
<str> lead section full HTML
This is the page's lead section, or summary, in full HTML including references, citations, and infoboxes.
Example:
>>> gandhi.data['lead'][:80]
u'<span><p><a href="/wiki/Mah\u0101tm\u0101" title="Mah\u0101tm\u0101">Mah\u0101tm\u0101</a> <b>Mohandas Karamch'
Methods: get_restbase('/page/mobile-sections-lead/{title}')
<int> page length in bytes
This is the size of the page in bytes from Mediawiki API:Info.
Example:
>>> gandhi.data['length']
264127
Methods: get_query()
<list> list of page links
This is the list of links found in this page from API:Links. This turns out to be super helpful if you've been redirected to a disambiguation page; the page you want is probably in this list.
Example:
>>> gandhi.data['links'][:3]
[u'10 Janpath', u'14th Dalai Lama', u'1915 Singapore Mutiny']
Methods: get_query()
<dict> last modified dates
This attribute contains the last modified dates of the page and its associated wikidata.
Example:
>>> gandhi.data['modified']
{'page': u'2017-09-23T17:28:59Z', 'wikidata': u'2017-09-23T16:39:57Z'}
Methods: get_query(), get_restbase(), get_wikidata()
<int> Wikipedia database ID
This is the numeric identifier of the page in the Mediawiki database. It is useful as a pivot point for wptools to gather information across APIs.
Example:
>>> gandhi.data['pageid']
19379
Methods: get_query(), get_parse(), get_restbase(), get_wikidata()
<str> page parsetree XML
This the full parsetree XML for the page which is used by wptools to parse infoboxes. It is certainly useful for a great many other things too.
Example:
>>> gandhi.data['parsetree'][:80]
u'<root><template><title>Redirect</title><part><name index="1"/><value>Gandhi</val'
Methods: get_parse()
<dict> page wikidata properties
This attribute contains any properties found in the page's wikidata. These properties are basically wikidata values. In wikibase, entities have claims (labels) and properties (values). Properties can have claims as values.
Example:
>>> gandhi.data['properties']
{u'P18': [u'Portrait Gandhi.jpg'],
u'P27': [u'Q129286', u'Q668'],
u'P31': [u'Q5'],
u'P345': [u'nm0003987'],
u'P569': [u'+1869-10-02T00:00:00Z'],
u'P570': [u'+1948-01-30T00:00:00Z'],
u'P910': [u'Q6512732']}
Methods: get_wikidata()
<str> a random Mediawiki title
This attribute contains a random title that we get for free with some requests.
Example:
>>> gandhi.data['random']
u'Elfcon'
Methods: get_query()
<list> list of redirects
This is the list of redirects that your query went through to get to the resulting page from API:Redirects.
Example:
>>> gandhi.data['redirected']
[{u'from': u'Gandhi', u'to': u'Mahatma Gandhi'}]
Methods: get_query()
<list> list of redirect titles
This is the list of titles that redirect to this page from API:Redirects.
Example:
>>> len(gandhi.data['redirects'])
53
>>> gandhi.data['redirects'][:3]
[{u'ns': 0, u'pageid': 55342, u'title': u'Mahatma Ghandi'},
{u'ns': 0, u'pageid': 55343, u'title': u'Ghandi'},
{u'ns': 0, u'pageid': 155811, u'title': u'Mohandas K. Gandhi'}]
Methods: get_query()
<str> the page's normalized title
This is the normalized title of the page from the APIs. Pretty straightforward!
Example:
>>> gandhi.data['title']
u'Mahatma Gandhi'
Methods: get_parse(), get_query(), get_random(), get_restbase(), get_wikidata()
<str> canonical URL
This is the canonical URL formed from Mediawiki convention.
Example:
>>> gandhi.data['url']
u'https://en.wikipedia.org/wiki/Mahatma_Gandhi'
Methods: get_query(), get_restbase()
<str> raw wikitext URL
This is the ostensible direct link to a page's wikitext. However, this link does not always resolve correctly, for instance, if there is a period in the title like 'J.R.R. Tolkien'.
Example:
>>> gandhi.data['url_raw']
u'https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw'
Methods: get_query(), get_restbase()
<int> average daily page views
We average the daily page views from the last WEEK from API:Query prop=pageviews. No way!
Example:
>>> gandhi.data['views']
21718
Methods: get_query(), get_querymore()
<int> number of page watchers
This is simply the number of people watching the page from Mediawiki API:Info. Intriguing!
Example:
>>> gandhi.data['watchers']
1733
Methods: get_query()
<str> wikidata classification
This is Wikidata Property:P31 "instance of", which basically tells us something about what this page is. Incredibly useful if you're not familiar with the title and want to know what kind of data you are looking at.
Example:
>>> gandhi.data['what']
u'human'
Methods: get_wikidata()
<str> wikibase item ID
This is the wikibase item identifier that represents an object, concept, or event in Wikidata.
Example:
>>> gandhi.data['wikibase']
u'Q1001'
Methods: get_wikidata()
<dict> the actual wikidata for a page
This is the collection of Wikidata for a page presented with labels. As the Wikidata project matures, it will come closer to what we get with page.data['infobox']
, but much better because it will have a standardized structure!
Example:
>>> len(gandhi.data['wikidata'])
105
>>> gandhi.data['wikidata'] # snippetd
...
u'country of citizenship (P27)': [u'British Raj (Q129286)', u'India (Q668)'],
u'date of birth (P569)': u'+1869-10-02T00:00:00Z',
u'date of death (P570)': u'+1948-01-30T00:00:00Z',
...
Methods: get_wikidata()
<str> wikidata URL
This is simply the URL to a page's Wikidata page.
Example:
>>> gandhi.data['wikidata_url']
u'https://www.wikidata.org/wiki/Q1001'
Methods: get_parse(), get_query(), get_restbase(), get_wikidata()
<str> page wikitext
This is the raw wikitext used to render Mediawiki pages. It took me a while to figure out that there is absolutely no hope of reproducing the HTML that results from Mediawiki and its vast ecosystem of templates and add-ons from the raw wikitext yourself (see Parsoid). Phenomenal!
Example:
>>> gandhi.data['wikitext'][:80]
u'{{Redirect|Gandhi}}\n{{pp-move-indef}}\n{{pp-semi-indef|small=yes}}\n{{Use dmy date'
Methods: get_parse()
<int> category page ID
The category page ID from API:Random.
Example:
>>> cat.data['pageid']
44375025
Methods: wptools.category(), get_random()
<list> list of category members
This is the list of category members from API:Categorymembers.
Example:
>>> cat.data['members'][:3]
[{u'ns': 0, u'pageid': 43686772, u'title': u'The Jazz Messengers'},
{u'ns': 0, u'pageid': 10932853, u'title': u'Dale Barlow'},
{u'ns': 0, u'pageid': 32306397, u'title': u'Mickey Bass'}]
Methods: get_members()
<str> the category title
If you do not supply a title, a random category lookup will supply a title.
Example:
>>> cat = wptools.category()
en.wikipedia.org (random:14) 🍭
Category:Jazz Messengers (en) data
{
pageid: 44375025
title: Category:Jazz Messengers
}
Methods: wptools.category(), get_random()
info, mostviewed, site, sites, siteviews
<dict> general site info
This is the "general" site info from API:Siteinfo which includes mostly Mediawiki instance parameters like namespaces, magicwords, etc.
Example:
>>> site.data['info'].keys()[:10]
[u'invalidusernamechars',
u'phpversion',
u'imagewhitelistenabled',
u'legaltitlechars',
u'servername',
u'thumblimits',
u'linktrail',
u'hhvmversion',
u'favicon',
u'maxarticlesize']
Methods: get_info()
<list> the mostviewed pages for this site
This is the list of "mostviewed" pages from Extension:PageViewInfo based on the last day's pageview count. Compare to The Top 25 Report.
Example:
In [12]: site.data['mostviewed'][:10]
Out[12]:
[{u'count': 16600645, u'ns': 0, u'title': u'Main Page'},
{u'count': 2144801, u'ns': -1, u'title': u'Special:Search'},
{u'count': 1980547, u'ns': 0, u'title': u'Curse, Inc.'},
{u'count': 523215, u'ns': 0, u'title': u'Antoine-Augustin Parmentier'},
{u'count': 330018, u'ns': 0, u'title': u'XHamster'},
{u'count': 211041, u'ns': 0, u'title': u'Logical conjunction'},
{u'count': 201357,
u'ns': 0,
u'title': u'Concerns and controversies at the 2008 Summer Olympics'},
{u'count': 184652, u'ns': 0, u'title': u'It (2017 film)'},
{u'count': 157837, u'ns': -1, u'title': u'Special:Book'},
{u'count': 150453, u'ns': 0, u'title': u'Rosh Hashanah'}]
Methods: get_info()
<str> site ID
This the Wikimedia site ID from API:Siteinfo general info wikiid
item.
Example:
>>> site.data['site']
u'enwiki'
Methods: get_info()
<list> list of Mediawiki sites
This is the comprehensive list of Mediawiki sites hosted by the Wikimedia foundation (currently 741) from Extension:SiteMatrix.
Example:
>>> sorted(site.data['sites'])[:10]
[u'https://ab.wikipedia.org',
u'https://ace.wikipedia.org',
u'https://ady.wikipedia.org',
u'https://af.wikibooks.org',
u'https://af.wikipedia.org',
u'https://af.wikiquote.org',
u'https://af.wiktionary.org',
u'https://ak.wikipedia.org',
u'https://als.wikipedia.org',
u'https://am.wikipedia.org']
Methods: get_sites()
<int> sitewide page views
This is the sitewide pageviews averaged over the last WEEK from Extension:PageViewInfo.
Example:
>>> site.data['siteviews']
240896573
Methods: get_info()