You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Coming from your recent comment (edited) in the chemistry chat:
Do you have any way any way of deciphering the DOI from a given arXiv URL, like for example https://arxiv.org/abs/1803.00014? This one has a DOI of 10.1103/PhysRevD.97.123524 but I find it impossible to comprehend from the URL (without loading and scraping the webpage of course). Am I right?
I said you are right, and that an arxiv article might not be peer-reviewed and published at all. That does not mean that you shouldn't be able to cite it. Now it'll be fairly complicated to go through all the motions, if an arXiv source is cited, but we can leave that to another day for now and focus on supporting only the platform.
Basically you can use the arXiv API (help) export.arxiv.org to extract the meta data. Let's go ahead with the example from above. Testset with different URL spaces:
The last one, arXiv:1803.00014, is the arxiv-id which you need.
This should all basically be straight forward to parse, taking the regex from wikidata for the id: (\d{4}.\d{4,5}|[a-z\-]+(\.[A-Z]{2})?/\d{7})(v\d+)? You actually only need the digit part in the front, so \[Aa][Rr][Xx][Ii][Vv][^\d]+(\d{4}.\d{4,5})\ should already give you the number necessary, see here that it works.
Now you only have to plug it into the api: http://export.arxiv.org/api/query?id_list=1803.00014
and you'll get back the meta data.
You can extract title <title>(.*)<\/title>, and author(s) <author>(.*)<\/author>, and category <arxiv:primary_category.+?(?=term)term="([^"]+)"[^>]+>[ref. How to match "anything up until this sequence of characters" in a regular expression?]. Add-on: get the doi, where I believe you already have regex to do that in place, or in this case easily <arxiv:doi[^>]+>(.*)<\/arxiv:doi>.
Then cite it as follows:
short: arXiv:dddd.dddd(d) **\[prim_categ\]**
long:
1. Author, A. ; Author, B. TITLE(.) arXiv:dddd.dddd(d) **\[prim_categ\]**
Specifically in this case:
short: arXiv:1803.00014 **[gr-qc]**
long:
1. Diez-Tejedor, A.; Flores, F.; Niz, G. Horndeski dark matter and beyond.
arXiv:1803.00014 **[gr-qc]**
You can further extend this after getting the DOI to include the Journal publication like (but I think that is only sensible in the long format):
long:
1. Diez-Tejedor, A.; Flores, F.; Niz, G. Horndeski dark matter and beyond.
arXiv:1803.00014 **[gr-qc]** <br />
Published as: Diez-Tejedor, A.; Flores, F.; Niz, G. Horndeski dark matter and beyond.
*Phys. Rev. D* **2018,** *97* (12), 123524.
[DOI: 10.1103/PhysRevD.97.123524](https://doi.org/10.1103/PhysRevD.97.123524).
The text was updated successfully, but these errors were encountered:
Coming from your recent comment (edited) in the chemistry chat:
I said you are right, and that an arxiv article might not be peer-reviewed and published at all. That does not mean that you shouldn't be able to cite it. Now it'll be fairly complicated to go through all the motions, if an arXiv source is cited, but we can leave that to another day for now and focus on supporting only the platform.
Basically you can use the arXiv API (help) export.arxiv.org to extract the meta data. Let's go ahead with the example from above. Testset with different URL spaces:
The last one,
arXiv:1803.00014
, is the arxiv-id which you need.This should all basically be straight forward to parse, taking the regex from wikidata for the id:
(\d{4}.\d{4,5}|[a-z\-]+(\.[A-Z]{2})?/\d{7})(v\d+)?
You actually only need the digit part in the front, so\[Aa][Rr][Xx][Ii][Vv][^\d]+(\d{4}.\d{4,5})\
should already give you the number necessary, see here that it works.Now you only have to plug it into the api:
http://export.arxiv.org/api/query?id_list=1803.00014
and you'll get back the meta data.
You can extract title
<title>(.*)<\/title>
, and author(s)<author>(.*)<\/author>
, and category<arxiv:primary_category.+?(?=term)term="([^"]+)"[^>]+>
[ref. How to match "anything up until this sequence of characters" in a regular expression?]. Add-on: get the doi, where I believe you already have regex to do that in place, or in this case easily<arxiv:doi[^>]+>(.*)<\/arxiv:doi>
.Then cite it as follows:
arXiv:dddd.dddd(d) **\[prim_categ\]**
Specifically in this case:
arXiv:1803.00014 **[gr-qc]**
You can further extend this after getting the DOI to include the Journal publication like (but I think that is only sensible in the long format):
The text was updated successfully, but these errors were encountered: