Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refurbish Re-CitationBot #127

Open
Daniel-Mietchen opened this issue Dec 11, 2015 · 6 comments
Open

Refurbish Re-CitationBot #127

Daniel-Mietchen opened this issue Dec 11, 2015 · 6 comments
Assignees

Comments

@Daniel-Mietchen
Copy link
Member

Re-CitationBot currently does three main things:

  • uploading full-texts to Wikisource (cf. dedicated category)
    • if that text includes formulas or tables supplied as images, they go to Wikisource as well (since Commons won't accept them)
  • uploading images to Wikimedia Commons (cf. dedicated category)

Over time, it is also to take on part or all of the following functions:

  • uploading the metadata of Wikisource imports to Wikidata
  • updating the Wikipedia page that cited the original DOI that triggered the workflow, such that that Wikipedia page then also links to the relevant Wikisource, Wikimedia Commons and Wikidata entries.

There are a number of issues around that, which we will explore in more detail later. Some of the more urgent ones:

  • UI: since our workflow (Workflow #113) is aimed at automating things to the extent possible, we do not envision a shiny UI - it basically just has to work for us, and especially me, so that we can monitor what happens at every step, and test/ debug each step individually.
  • some of the full-texts are actually not full texts, but missing or misrepresenting some parts, which is why we are still working on the JATS-to-MediaWiki converter
  • Merger with Open Access Media Importer: that bot has been active since mid-2012, its code is in a neighbouring repo here under wpoa, and it forms the basis of much of the current code of the Recitation-bot. It currently runs off a server at a German university but should eventually be moved to Wikimedia Labs, where Recitation-bot already is. But since its scope (audio and video files from open access scholarly articles available in JATS) is just a subset of the scope of Recitation-bot on Commons, it makes sense to merge the two.
  • Categorization: while uploading stuff to Wikimedia Commons, it is important to set useful categories, so that people (and, increasingly, tools) can find them there. We are doing this by making use of JATS tags for keywords and subject matter, but this is still error-prone, since publishers (who provide the JATS) use these tags inconsistently. That's where @difranco's interest in topic modelling might come in handy. Categorization is less important on Wikisource, and on Wikidata, it would be nice to enrich the basic article metadata with statements about the main subject (P921), as in this example.

Pinging #118.

@Daniel-Mietchen
Copy link
Member Author

To get started, I am now giving the bot a test run and documenting what happens.

  • I went to the Wikipedia DOI citation live stream and searched for a whitelisted DOI, finding 10.3897/phytokeys.57.6347 through this edit

  • I then pasted this DOI into the Recitation-bot UI, which triggered the creation of this tracking page but no upload within 10 min, and no entry in the fail log that would tell me what the problem was.

  • I then went to the Recitation-bot control and chose "show status", which gave

    job-ID prior name user state submit/start at queue slots ja-task-ID

    805863 0.71865 python tools.recita Rr 08/17/2015 15:04:37 [email protected] 1
    1764835 0.35999 lighttpd-r tools.recita r 11/20/2015 19:26:58 webgrid-lighttpd@tools-webgrid 1

  • This means that the bot was running since shortly after the last upload in July.

  • I checked the upload tracking page - no progress even after 20 min.

  • I then went to the Recitation-bot control and chose "shutdown bot", which gave

    attempted to qdel python jobs
    tools.recitation-bot has registered the job 805863 for deletion

  • After waiting for about 2 min for the job deletion to take place (and still no entry in the fail log), I selected "start bot" and then "show status", which gave

    job-ID prior name user state submit/start at queue slots ja-task-ID

    1764835 0.36002 lighttpd-r tools.recita r 11/20/2015 19:26:58 webgrid-lighttpd@tools-webgrid 1
    272655 0.30000 python tools.recita r 12/11/2015 21:31:42 [email protected] 1

  • This finally resulted in an update to the tracking page and an entry in the fail log:

    FAIL MESSAGE:'pmcid'.

  • This makes sense, since PhytoKeys is a plant science journal and thus not tracked in PubMed Central - good use case for Import of JATS articles from non-PMC sources #126, since all its articles are available in JATS.

  • Now retrying.

  • I went to the Open Access Media Importer Bot, went to one of its most recent uploads (this file) and chose its DOI (10.1371/journal.pone.0142917) as the next one to test (that file was imported from PMC), so at least I should not get the same error this time.

  • Entering this DOI into the Recitation-bot UI triggered the creation of this tracking page and quickly an error both on that page and in the fail log:

    Invalid version number "1.27.0-wmf.8"

  • This means that the bot did not find the MediaWiki version it expected. Ideally, it should work with the latest version and have some provisions to deal with newer versions gracefully.

@notconfusing
Copy link
Member

I think the invalid mediawiki version is an error I encountered before that
requires an update of pywikibot @difranco
On 11 Dec 2015 4:06 p.m., "Daniel Mietchen" [email protected]
wrote:

To get started, I am now giving the bot a test run and documenting what
happens.


Reply to this email directly or view it on GitHub
#127 (comment).

@difranco
Copy link
Collaborator

difranco commented Jan 4, 2016

I've sat down with Max and we've figured this out. Moving on to the improvements.

@difranco
Copy link
Collaborator

I've made a quick update to cocytus to check for license data on added DOIs against the crossref api, just logs for now, next step would be to have it stick them in the appropriate part of the recitation-bot queue.

[futures 9643ec5] add checking crossref api for license info

@difranco
Copy link
Collaborator

I'm looking at the other Feb. milestone in the proposal doc right now, "February 2016: Citation information from Wikidata can be transcluded into English Wikipedia and English Wikisource through Wikidata-supported {{Cite doi}} templates"

To come up with the approach I need to take I'm reviewing how the {{cite doi}} templates work and trying to find the relevant pywikibot documentation. I found some signs that this is in flux / has been deprecated in favor of {{cite journal|doi}} and that the relevant distinction is that this changes from making one template per doi to a single journal template that references other documents by doi. Not sure yet what state the data is in or what implications these two ways things have been done might have. Lengthy discussion which I am trying to assimilate the upshots of at https://en.wikipedia.org/wiki/Template_talk:Cite_doi/Archive_1#RfC:_Should_Template:cite_doi_cease_creating_a_separate_subpage_for_each_DOI.3F

Would appreciate any pointers you can provide to quality information on this, @Daniel-Mietchen and @notconfusing

@Daniel-Mietchen
Copy link
Member Author

Yes, the {{Cite doi}} template has been deprecated on the English Wikipedia, and citations are handled instead through Module:Citation/CS1, which is used indirectly via citation templates that the community liked better than the {{Cite doi}} ones. A version of Module:Citation/CS1 also exists on the English Wikisource. As far as I can tell, those templates and modules do not make use of Wikidata yet, at least not for bibliographic metadata.

Closer to our goals of pulling bibliographic information from Wikidata is thus Module:Cite on the English Wikipedia, but it is in a very raw state and while good for experimentation in user or project namespaces, it is not fit for the article namespace yet.

We have a similar Lua module on the French Wikipedia that has begun to be used in templates in their main namespace, even though it still needs quite some work, and we have another such Lua module on Wikidata itself that has some basic functionality:

My approach would be to start improving Module:Cite on Wikidata first, so that all test articles display correctly first on Wikidata, then improve its counterpart on the English Wikipedia and then the English Wikisource (once a Lua module works fine on one wiki, getting it to work on another wiki is relatively straightforward), while keeping an eye on how things develop on the French Wikipedia.

In doing this, we might get help from places like Wikipedia:Lua/Requests but I haven't tried that yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants