Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BibTeX ABS export: trailing <P /> #172

Open
golnazads opened this issue Jun 16, 2020 · 4 comments
Open

BibTeX ABS export: trailing <P /> #172

golnazads opened this issue Jun 16, 2020 · 4 comments

Comments

@golnazads
Copy link
Contributor

golnazads commented Jun 16, 2020

Alberto
replied to
You
@Carolyn @golnaz sorry, I neglected to let you know of this possible markup. Please translate <P /> to blank lines and <BR /> to a newline when outputting in a non-XML format. I think this means that for bibtex it would be:
<P /> => \\

@golnazads
Copy link
Contributor Author

@aaccomazzi
this is implemented for BibTex ABS. do I need to remove these tags for for example custom format unicode encoding. I am guessing it is a yes for latex encoding. If it is a yes for unicode, then I guess need to fix that for XML and fielded formats, right? thank you.

@aaccomazzi
Copy link
Member

This is the situation with respect to encoding in our json fields (see e.g. 2016ApJ...818L..26F)

  1. abstract and title text have the basic HTML entities encoded (these are < > and &)
  2. they may also have some markup in the form of <SUB> etc.

When creating custom output, we recognize and support three basic encoding:

  1. HTML: In this case the entities and markup are kept as they are, so &lt; remains &lt;
  2. Latex: in this case the entities and markup are translated according to html -> latex syntax
  3. Unicode: In this case the entities are turned into their unicode equivalent, in this case it's just the three characters above which become <, >, &. The issue of markup for unicode encoding has never been formally defined in our documentation and I had to go check the code of classic to figure out what we are doing here. Turns out classic simply strips the markup: <SUB> -> (empty string)

I feel that the unicode handling of markup done by classic is wrong, because we provide a separate formatting option to control the treatment of markup (%ZMarkup:{keep|strip}), as documented here: http://adsabs.github.io/help/actions/export
So I'm in favor of passing through markup as it is, and let users customize the output via formatting options.

@golnazads
Copy link
Contributor Author

just for your information export has the option of markup keep|strip https://github.com/adsabs/export_service/blob/master/exportsrv/formatter/customFormat.py#L702. I can remove it if you want @aaccomazzi .

@aaccomazzi
Copy link
Member

We should keep the markup option, this way users can control what they get or not get.
So I think the adjustments to make for unicode encoding are:

  1. <P /> => \n\n (new paragraph)
  2. <BR /> => \n (newline)
  3. &amp;, &gt;, &lt; => &, >, <
  4. All other markup: controlled by %ZMarkup settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants