-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API-only export option (without Special:Export) #311
Comments
Does not yet work for Wikia, partly because they return a blank page for
|
Before even downloading the first revisions, there is some wiki where the export gets stuck in an endless loop of "Invalid JSON response. Trying the request again" or similar message:
|
Now tested with a 1.12 wiki,http://meritbadge.org/wiki/index.php/Main_Page , courtesy https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/090004.html : 27cbdfd 680145e |
For Wikia, the API export works without But facepalm, where the API help says "Export the current revisions of all given or generated pages" it really means that any revision other than the current one is ignored: http://00eggsontoast00.wikia.com/api.php?action=query&revids=3|80|85&export is the same as http://00eggsontoast00.wikia.com/api.php?action=query&revids=85&export |
Here we go: 7143f7e It's very fast on most wikis, because it makes way less requests if your average number of revisions per page is less than 50. The first dump produced with this method is: https://archive.org/download/wiki-ferstaberindecom_f2_en/ferstaberindecom_f2_en-20180519-history.xml.7z |
And now also Wikia, without the allrevisions module: https://archive.org/details/wiki-00eggsontoast00wikiacom The XML built "manually" with |
|
In testing this for Wikia, remember that the number of edits on Special:Statistics isn't always truthful (this is normal on MediaWiki). For instance http://themodifyers.wikia.com/wiki/Special:Statistics says 2333 edits, but dumpgenerator.py exports 1864, and that's the right amount: entering all the titles on themodifyers.wikia.com/wiki/Special:Export and exporting all revisions gives the same amount. Also, a page with 53 revisions on that wiki was correctly exported, which means that API continuation works; that's something! |
Not sure what's going on at http://zh.asoiaf.wikia.com/api.php
http://zhpad.wikia.com/api.php seems to eventually fail as well |
Next step: implementing resuming. I'll probably take the I think it would be the occasion to make sure that we log something to error.log when we catch an exception or call |
Later I'll post a series of errors.log from failed dumps. For now I tend to believe that, when the dump runs to the end, the XML really is as complete as possible. For instance, on a biggish wiki like http://finalfantasy.wikia.com/wiki/Special:Statistics :
That's over a million "missing" revisions compared to what Special:Statistics says, which however cannot really be trusted. The number of pages is pretty close. On the other hand, it could be that the continuation is not working in some cases... In clubpenguinwikiacom-20180523-history.xml, I'm not sure I see the 3200 revisions that the main page ought to have. |
Otherwise the query continuation may fail and only the top revisions will be exported. Tested with Wikia: http://clubpenguin.wikia.com/api.php?action=query&prop=revisions&titles=Club_Penguin_Wiki Also add parentid since it's available after all. #311 (comment)
Some wiki might be in a loop...
Or not: it seems legit, some bot is editing a series of pages every day. http://runescape.wikia.com/wiki/Module:Exchange/Dragon_crossbow_(u)/Data?limit=1000&action=history |
Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)
|
* It was just an old trick to get past some barriers which were waived with GET. * It's not conformant and doesn't play well with some redirects. * Some recent wikis seem to not like it at all, see also issue WikiTeam#311.
Sometimes
|
How nice some webservers are:
|
Gotta check for actual presence of the
|
HTTP 405:
|
Or even the
|
HTTP Error 493 :o
|
I'm not quite sure why this happens in my latest local code, will need to check:
|
|
mwclient doesn't seem to handle retries very well, need to check:
|
Seems fine now on a MediaWiki 1.16 wiki. There are some differences in what we get for some optional fields like parentid, userid, size of a revision; and our XML made by etree is less eager to escape Unicode characters. Hopefully doesn't matter, although we should ideally test an import on a recent MediaWiki. |
This comes and goes, could try adding to status_forcelist together with 406 seen for other wikis. Here we can do little, the index.php and api.php responses confuse the script but indeed there isn't much we can do as even the most basic response gets a DB error:
This is not helped by setting This is a misconfigured wiki, see #355 (comment)
This one now (MediaWiki 1.31.1) gives:
Still broken (MediaWiki 1.23)
Still broken (MediaWiki 1.27).
Still broken (MediaWiki 1.31) |
The number of revisions cannot always be a multiple of 50 (example from https://villainsrpg.fandom.com/ ):
It should be 51 in https://villainsrpg.fandom.com/wiki/Evil?offset=20111224190533&action=history Ouch no, we were not using the new batch at all. Ahem. |
The XML doesn't validate against the respective schema:
But then even the vanilla Special:Export output doesn't. Makes me sad.
|
Fine now
Fixed with API limit 50 at b162e7b
Fixed with automatic switch to HTTPS at d543f7d |
Still have to implement resume:
It should just be a matter of passing |
I'm happy to see that we sometimes receive less than the requested 50 revisions and nothing bad happens:
|
Except that they didn't check whether they had revisions bigger than that: |
Hm, I wonder why so many errors on this MediaWiki 1.25 wiki (the XML became half of the previous round) https://archive.org/download/wiki-wikimarionorg/wikimarionorg-20200224-history.xml.7z/errors.log |
|
http://www.veikkos-archiv.com/api.php fails completely |
Simple command with which I found some XML files which were actually empty (only the header):
|
Wanted for various reasons. Current implementation:
--xmlrevisions
, false by default. If the default method to download wikis doesn't work for you, please try using the flag--xmlrevisions
and let us know how it went.https://groups.google.com/forum/#!topic/wikiteam-discuss/ba2K-WeRJ-0
Previous takes:
#195
#280
The text was updated successfully, but these errors were encountered: