Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[citeproc] CSL JSON in-field markup not recognized when using pandoc-zotxt.lua #6722

Closed
njbart opened this issue Oct 6, 2020 · 24 comments
Closed

Comments

@njbart
Copy link

njbart commented Oct 6, 2020

The latest master, self-compiled, identifying as 2.11, does not seem to recognize CSL JSON in-field markup (e.g., <i>...</i>) if biblio data are obtained from Zotero using the pandoc-zotxt.lua filter in its default mode, i.e., not having the filter write an intermediary/cache CSL JSON file to disk.

Note that for using pandoc-zotxt.lua, the Zotero add-ons zotxt and, IIRC, Better BibTeX (the latter for useable citation keys) need to be installed.

By contrast, after adding a line zotero-bibliography: test.json to the markdown document’s metadata (which leads to a an intermediary/cache CSL JSON file being written to disk), the in-field markup is recognized, just as it used to be when using pandoc 2.10.1, with or without the intermediary .json file.

My hunch is that the most recent pandoc might not be properly identifying the incoming data from the pandoc-zotxt.lua filter as CSL JSON and hence might not be trying to parse the html-like in-field markup.

MWE; prerequisite: Zotero plus add-ons listed above installed and running, a Zotero item containing <i>...</i> in its title field:

pandoc -L pandoc-zotxt.lua -C -t markdown-citations << EOT

Blah [@a-zotero-item].

# References {-}

---
# zotero-bibliography: test.json
...
EOT

Actual output:

Blah (Doe 2020).

References {#references .unnumbered}
==========

::: {#refs .references .csl-bib-body .hanging-indent}
::: {#ref-a-zotero-item .csl-entry}
Doe, John 2020. "A Title with \<i\>a Few Words
Italicized\</i\>."
:::
:::

Expected output = output from pandoc 2.11 with the zotero-bibliography line uncommented = output from pandoc 2.10.1, regardless of the zotero-bibliography line:

Blah (Doe 2020).

References {#references .unnumbered}
==========

::: {#refs .references .csl-bib-body .hanging-indent}
::: {#ref-a-zotero-item .csl-entry}
Doe, John 2020. "A Title with *a Few Words
Italicized*."
:::
:::
@jgm
Copy link
Owner

jgm commented Oct 6, 2020

This sounds like an issue with pandoc-zotxt.lua (about which I know nothing).
What does it do?

@njbart
Copy link
Author

njbart commented Oct 6, 2020

From https://github.com/odkr/pandoc-zotxt.lua: “pandoc-zotxt.lua looks up sources of citations in Zotero and adds them either to a document's references metadata field or to a bibliography file , where pandoc-citeproc can pick them up.”

Maybe the output from the following might give a clue (again, requires Zotero plus add-ons installed and running, as above):

pandoc -L pandoc-zotxt.lua -t json << EOT

Blah [@item1].

# References {-}

EOT

Output from pandoc 2.10:

{"blocks":[{"t":"Para","c":[{"t":"Str","c":"Blah"},{"t":"Space"},{"t":"Cite","c":[[{"citationSuffix":[],"citationNoteNum":0,"citationMode":{"t":"NormalCitation"},"citationPrefix":[],"citationId":"item1","citationHash":0}],[{"t":"Str","c":"[@item1]"}]]},{"t":"Str","c":"."}]},{"t":"Header","c":[1,["references",["unnumbered"],[]],[{"t":"Str","c":"References"}]]}],"pandoc-api-version":[1,21],"meta":{"references":{"t":"MetaList","c":[{"t":"MetaMap","c":{"id":{"t":"MetaString","c":"item1"},"title":{"t":"MetaString","c":"A title with <i>a few words italicized</i>"},"type":{"t":"MetaString","c":"article-journal"}}}]}}}

Output from pandoc 2.11:

{"pandoc-api-version":[1,22],"meta":{"references":{"t":"MetaList","c":[{"t":"MetaMap","c":{"id":{"t":"MetaString","c":"item1"},"title":{"t":"MetaString","c":"A title with <i>a few words italicized</i>"},"type":{"t":"MetaString","c":"article-journal"}}}]}},"blocks":[{"t":"Para","c":[{"t":"Str","c":"Blah"},{"t":"Space"},{"t":"Cite","c":[[{"citationId":"item1","citationPrefix":[],"citationSuffix":[],"citationMode":{"t":"NormalCitation"},"citationNoteNum":1,"citationHash":0}],[{"t":"Str","c":"[@item1]"}]]},{"t":"Str","c":"."}]},{"t":"Header","c":[1,["references",["unnumbered"],[]],[{"t":"Str","c":"References"}]]}]}

Note that the title that appears ("A title with <i>a few words italicized</i>") is identical in both cases.

Question is: When formatting this further, why would pandoc 2.10 parse the in-field markup correctly while 2.11 does not?

@jgm
Copy link
Owner

jgm commented Oct 6, 2020

Looping in @odkr as zotext.lua maintainer, since I don't know the details of how zotext is doing this, and how the conversion worked before.

It may be that zotext.lua is going to need to check pandoc version and behave differently for 2.11+. For pandoc 2.11, you can just use pandoc.read(string, "csljson") to convert a CSL JSON bibliography to a Pandoc, and then just grab the references from the metadata. (Actually, we have a more directly relevant function, cslJsonToReferences, in the unexposed module Text.Pandoc.Citeproc.CslJson. I wouldn't mind exposing that module if it makes things easier, though we'd also have to expose it in the lua interface. It's probably easier just to use read, which should work already.)

@njbart
Copy link
Author

njbart commented Oct 7, 2020

I do hope @odkr will weigh in. Still, I might be wrong, but in my mind this seems to be a pandoc, not a pandoc-zotxt.lua issue:

The two .json files above show that pandoc-zotxt.lua seems to have done its job successfully in both cases, with both files containing essentially the same information. This goes as far as that the first can be processed with pandoc 2.11 and the second with 2.10, provided pandoc-api-version is edited appropriately before. Again, 2.10 can parse the <i>...</i> tags as expected in the .json file generated by 2.10 while 2.11 renders these tags in the .json created by 2.10 as-is (&lt;i&gt;a Few Words Italicized&lt;/i&gt; in html output).

So something relevant does seem to have changed between pandoc 2.10 and 2.11. But what?

@njbart njbart changed the title New citeproc regression: CSL JSON in-field markup not recognized when using pandoc-zotxt.lua [citeproc] CSL JSON in-field markup not recognized when using pandoc-zotxt.lua Oct 7, 2020
@odkr
Copy link

odkr commented Oct 7, 2020

Prima facie, I agree with @njbart. pandoc-zotxt.lua doesn’t do a whole lot. It collects citation keys and then sends an HTTP request for each one to zotxt, which responds with bibliography entries in CSL JSON. pandoc-zotxt.lua then parses those entries and either adds them to the references metadata field or writes them out to a file and adds the path to that file to the bibliography metadata field. It leaves all actual work to pandoc-citeproc. The only place where a difference between the references metadata field and the bibliography file could creep in is the call to lunajson's encode function—but that lunajson somehow makes the formatting work strikes me as, err, unlikely. The semester is just starting here, so it’ll take a while before I can have a closer look at this.

@jgm
Copy link
Owner

jgm commented Oct 7, 2020

pandoc-zotxt.lua then parses those entries and either adds them to the references metadata field

The key question is: how does it parse them? I've told you how you could parse them reliably using pandoc 2.11+. How were you doing it before? If you can tell me that, maybe I can explain why it stopped working.

@odkr
Copy link

odkr commented Oct 7, 2020

It takes the CSL JSON from zotxt and passes it on to lunajson's decode function, which returns a multidimensional table. Older versions of pandoc-citeproc didn’t like that numeric values in zotxt's CSL JSON are (were?) floats, so pandoc-zotxt.lua rounds down every number and converts it to a string. It then adds the citekey as id to the root of the multidimensional table.

The relevant blocks of code are:

function get_source (citekey)
    assert(type(citekey) == 'string', 'given citekey is not a string')
    assert(citekey ~= '', 'given citekey is the empty string')
    local _, reply
    for i = 1, #keytypes do
        local query_url = concat({ZOTXT_QUERY_BASE_URL, keytypes[i], '=', citekey})
        -- This is the HTTP get request to zotxt.
        reply = select(2, fetch(query_url, '.'))
        -- This is the call to lunajson's decode function.
        local ok, data = pcall(decode, reply)
        if ok then
             insert(keytypes, 1, remove(keytypes, i))
             -- This is the number conversion (see below).
             local source = convert_numbers_to_strings(data[1])
             source.id = citekey
             return source
        end
    end
    return nil, reply
end

and

function convert_numbers_to_strings (data, depth)
    if not depth then depth = 1 end
    assert(depth < 512, 'too many recursions')
    local data_type = type(data)
    if data_type == 'table' then
        local s = {}
        for k, v in pairs(data) do 
            s[k] = convert_numbers_to_strings(v, depth + 1)
        end
        return s
    elseif data_type == 'number' then
        return tostring(floor(data))
    else
        return data
    end
 end

These entries are then added to the references metadata field or converted back to JSON again and written out to a bibliography file.

I’ll have a look at it and try a call to read (which is, of course, a good idea at any rate); it’s just that I’m not sure when I’ll get around to doing that.

I take it that pandoc-citeproc processes CSL that it reads from bibliography files differently than bibliographic data that it picks up from references?

@jgm
Copy link
Owner

jgm commented Oct 7, 2020

If you're just using a standard JSON parser, you'll get plain strings for the values of the variables (title etc.). If you add plain strings to a pandoc metadata field without parsing them into pandoc Inlines, then obviously things like <i> are just going to come out literally. From your description, it's rather obvious why it doesn't work -- but what I don't understand is why it did work before, if this is all you're doing.

@jgm
Copy link
Owner

jgm commented Oct 7, 2020

Note also that the relevant change in pandoc 2.11 is that pandoc-citeproc is no longer used. Instead, citation processing is built into pandoc (triggered by --citeproc), and a new library is used.

Like pandoc-citeproc, this library does indeed process CSL JSON differently from references in metadata. In pandoc's YAML metadata, the fields are interpreted as markdown. In CSL JSON, they are interpreted using the formatting rules for CSL JSON (with special treatment for <i> for example).

@jgm
Copy link
Owner

jgm commented Oct 7, 2020

Clarification: if you're directly adding things to metadata (e.g. references) in a lua filter, you need to add structured pandoc elements. Adding a string with markdown formatting is just adding a string. The reference fields are parsed as markdown by pandoc when they occur in the context of a document, but a lua filter manipulates the AST and hence works with pandoc elements (MetaVal, Inline, Block, ec.).

@odkr
Copy link

odkr commented Oct 7, 2020

Hmm, could it be that pandoc-citeproc, but not the new library, of course, processes <i> tags even if they occur in metadata fields? That’s the only way I see how this could have worked so far (a very useful bug ;-)). At any rate, that’s clearly an issue for pandoc-zotxt.lua. @njbart, thanks for reporting this! I’ll update pandoc-zotxt.lua soon-ish.

@odkr
Copy link

odkr commented Oct 7, 2020

Oh, and while I’m at it. Thanks, @jgm, for taking the time to clear that up!

@jgm
Copy link
Owner

jgm commented Oct 7, 2020

I'm happy to help further if you have questions while revising pandoc-zotxt.lua. We can keep this open for now if you like.

@odkr
Copy link

odkr commented Oct 8, 2020

Thanks! That’d be great.

@odkr
Copy link

odkr commented Oct 12, 2020

@njbart, I’ve updated pandoc-zotxt.lua with v0.3.18b. It works for me, but please give it try and let me know it works.

@odkr
Copy link

odkr commented Oct 12, 2020

@jgm, I’ve encountered a bug when I updated my test suite:

pandoc --citeproc -t plain <<EOF
---
references:
    - {"id":"díaz-león:2013what","type":"article-journal","title":"What Is Social Construction?","container-title":"European Journal of Philosophy","page":"1-16","URL":"http://onlinelibrary.wiley.com/doi/10.1111/ejop.12033/abstract","DOI":"10.1111/ejop.12033","ISSN":"1468-0378","language":"en","author":[{"family":"Díaz-León","given":"Esa"}],"issued":{"date-parts":[[2013,5,1]]},"accessed":{"date-parts":[[2015,11,14]]},"container-title-short":"Eur J Philos"}
    - {"id":"díaz-león:2015defence","type":"article-journal","title":"In Defence of Historical Constructivism about Races","container-title":"Ergo, an Open Access Journal of Philosophy","volume":"2","URL":"http://hdl.handle.net/2027/spo.12405314.0002.021","DOI":"10.3998/ergo.12405314.0002.021","ISSN":"2330-4014","author":[{"family":"Díaz-León","given":"Esa"}],"issued":{"date-parts":[[2015]]}}
    - {"id":"díaz-león:2016woman","type":"article-journal","title":"<i>Woman</i> as a Politically Significant Term: A Solution to the Puzzle","container-title":"Hypatia","page":"245-258","URL":"http://onlinelibrary.wiley.com.uaccess.univie.ac.at/doi/10.1111/hypa.12234/abstract","DOI":"10.1111/hypa.12234","ISSN":"1527-2001","title-short":"<i>Woman</i> as a Politically Significant Term","language":"en","author":[{"family":"Díaz-León","given":"Esa"}],"issued":{"date-parts":[[2016,2,1]]},"accessed":{"date-parts":[[2016,2,18]]},"container-title-short":"Hypatia"}
---

p [cf. @díaz-león:2013what; @díaz-león:2015defence; @díaz-león:2016woman], \
yet p [@díaz-león:2013what; @díaz-león:2015defence; @díaz-león:2016woman]!

EOF

prints:

p (cf. Díaz-León 2013; Díaz-León 2015, 2016),
yet p (Díaz-León 2013, 2015, 2016)!
[...]

That is, Pandoc fails to group (some) references together, but only when they are prefixed. pandoc-citeproc doesn’t do that.

Shall I open a new issue for that?

@jgm
Copy link
Owner

jgm commented Oct 12, 2020

Yes, why don't you close this, and perhaps we can open up a new issue for the grouping issue.
This behavior is intentional -- but perhaps there is a better option? The reason for it is that if you allow citation items with prefixes to be rearranged in grouping, you can get bad results, e.g.

[@foo13;  @foo12; see also @foo10]

should NOT become

(see also Foo 2010, 2012, 2013)

In your particular case, though, it would be harmless to group, because the position of the prefix isn't affected by sorting. I guess it would be helpful to compare citeproc-js's output.

@njbart
Copy link
Author

njbart commented Oct 13, 2020

@odkr, v0.3.18b behaves very well so far. Many thanks.

@odkr
Copy link

odkr commented Oct 13, 2020

@njbart, great! Could you close the issue then? (If it turns out that v0.3.18b is buggy after all, you can open an issue at its repo).

@odkr
Copy link

odkr commented Oct 13, 2020

@jgm, if it’s intended then I think it’s fine as it is. What struck me as odd was the repitition of the author name. If there’s an easy way to fix this, that’s be good, IMHO. But if there isn’t, it’s fine, again IMHO, for that to be left to users to fix manually.

@njbart njbart closed this as completed Oct 13, 2020
@jgm
Copy link
Owner

jgm commented Oct 13, 2020

I think it would be good tot open a new issue, though I'm still not sure what to think about this case. Here's what citeproc.js yields:

(see Díaz-León 2013; 2016; 2015)

If the prefix is moved to 2016, citeproc.js yields:

(see Díaz-León 2016; 2013; 2015)

I guess the repetition of the author's name would better be avoided, here, and it would be good to have an issue to track this.

jgm added a commit to jgm/citeproc that referenced this issue Oct 13, 2020
Preivously if a citation item had a prefix, it would
not be grouped with following citations.

See jgm/pandoc#6722 for discussion.
@jgm
Copy link
Owner

jgm commented Oct 13, 2020

OK, I've pushed a fix to citeproc; now we get

(see Diaz-Leon 2013, 2016, 2015)

This differs from the citeproc.js output, but I think our output is right -- the cite-group-delimiter should be used here (and it defaults to comma). For relevant discussion see citation-style-language/test-suite#39.

@odkr
Copy link

odkr commented Oct 14, 2020

Oh, thanks a lot! I was about to open that issue, but that’s no longer needed, or is it?

@jgm
Copy link
Owner

jgm commented Oct 14, 2020

It's no longer needed.

odkr added a commit to odkr/pandoc-zotxt.lua that referenced this issue Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants