-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nrow(arxiv_search())
is unpredictable
#54
Comments
I think the arXiv works best for searches that return a smaller number of documents, so if you're looking for a reproducible example, maybe focus on the results for a particular year. res <- arxiv_search(
"cat:hep-ph AND submittedDate:[1992 TO 1993]",
limit=1000,
batchsize=1000,
sort_by="submitted",
ascending=FALSE) I don't know for sure, but I expect you're occasionally getting a connection error part-way through and getting truncated results. |
Thanks for your help, I see. I find several similar issues on the arXiv API google group, so that I guess the problem (if any) is from their part. I'll either try to limit my queries or switch to the OPI-PMH interface. Just for the records (not a reprex, just an instance of what I was saying above): library(aRxiv)
query <- "cat:hep-ph AND submittedDate:[2020 TO 2021]"
res <- arxiv_search(
query,
limit = 10000,
batchsize = 1000,
sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4
#> retrieved batch 5
#> retrieved batch 6
c(nrow(res), arxiv_count(query))
#> [1] 4100 6930
res <- arxiv_search(
query,
limit = 10000,
batchsize = 1000,
sort_by = "submitted", ascending = F
)
#> retrieved batch 1
#> retrieved batch 2
#> retrieved batch 3
#> retrieved batch 4
c(nrow(res), arxiv_count(query))
#> [1] 3000 6930 Created on 2021-07-18 by the reprex package (v2.0.0) |
Hi and thanks for this very nice package (it made my day!).
I'm trying to scrape the last, say, 15k papers from the
hep-ph
category, with:However, the number of rows in the returned dataframe varies from query to query (usually it is around 10k, but once I also got 1k)... I would love to provide a reproducible example but could not come up with one.
I'm not sure whether this is due to
aRxiv
orarXiv
😃 Have you ever noticed something similar? Might have something to do with your comments to #14 ?Thanks,
Valerio
The text was updated successfully, but these errors were encountered: