Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue with GetBibEntryWithDOI #24

Closed
robert-winkler opened this issue Dec 1, 2016 · 2 comments
Closed

Encoding issue with GetBibEntryWithDOI #24

robert-winkler opened this issue Dec 1, 2016 · 2 comments

Comments

@robert-winkler
Copy link

Hi, I am processing a bib file with various Spanish authors.
Many of them are messed up due to accents, e.g. . Gutia'errez-Uribe instead of Gutiérrez-Uribe.
Is there any solution?
Thanks, Robert

library(RefManageR)
bib <- ReadBib("draft.bib", check = FALSE)
bib <- GetDOIs(bib)
bibfromdoi <- GetBibEntryWithDOI(bib$doi)
WriteBib(bibfromdoi, file = "cleanfromdoi.bib", biblatex=TRUE, verbose=TRUE)

draft.bib.zip

@mwmclean
Copy link
Collaborator

mwmclean commented Dec 9, 2016

You can certainly use RefManageR to update the fields you encounter with bad accent formatting, e.g.

bib[[1]]$author <- "Doe, John and Gutiérrez-Uribe, Jane"

if you need to change the authors of the first entry, but it's not going to be the most efficient. Recent versions of R have allowed users to specify additional LaTeX macros in tools::parse_Rd; you may want to have a look at that, but I think your best best is to just use regular expressions and gsub.

@mwmclean mwmclean closed this as completed Jan 9, 2017
@hongyuanjia
Copy link

hongyuanjia commented May 20, 2018

I encountered similar issue both using RefManageR and bibtex package when reading BibTeX file with multi-byte characters on Windows. I opened an issue on bibtex, please see here.

# Get current locale info
Sys.getlocale()
#> [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

# Set locale to Chinese
Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <- "
    @misc{text,
        title = {{你好}},
        language = {zh-CN},
        author = {{你好}},
        month = jun,
        year = {2013},
        pages = {163}
    }
"
# change encoding to "UTF-8"
bib_text_utf8 <- enc2utf8(bib_text)
Encoding(bib_text_utf8)
#> [1] "UTF-8"

# make sure the saved BibTeX file is UTF-8 encoded
con <- file("test.bib", encoding = "UTF-8")
writeLines(bib_text_utf8, con)
close(con)

readLines("test.bib", encoding = "UTF-8")
#> [1] ""                                "        @misc{text,"            
#> [3] "            title = {{你好}},"   "            language = {zh-CN},"
#> [5] "            author = {{你好}},"  "            month = jun,"       
#> [7] "            year = {2013},"      "            pages = {163}"      
#> [9] "        }"                       "    "                           

It seems that bibtex::do_read_bib could not correctly encode multi-byte characters even if all related encoding options have been set to "UTF-8"

bib <- bibtex::do_read_bib("test.bib", "UTF-8", srcfile("test.bib", encoding = "UTF-8", "UTF-8"))
bib
#> [[1]]
#>      title   language     author      month       year      pages 
#> "{浣犲ソ}"    "zh-CN" "{浣犲ソ}"      "jun"     "2013"      "163" 
#> attr(,"entry")
#> [1] "misc"
#> attr(,"key")
#> [1] "text"
#> 
#> attr(,"include")
#> character(0)
#> attr(,"strings")
#> named character(0)
#> attr(,"preamble")
#> character(0)

Currently, one workaround is to encode the output of bibtex::do_read_bib manually to "UTF-8" in RefManageR::ReadBib. But I am not sure if this is the right way to do.

Encoding(bib[[1]])
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

Encoding(bib[[1) <- "UTF-8"
bib
#> [[1]]
#>      title   language     author      month       year      pages 
#> "{你好}"    "zh-CN" "{你好}"      "jun"     "2013"      "163" 
#> attr(,"entry")
#> [1] "misc"
#> attr(,"key")
#> [1] "text"
#> 
#> attr(,"include")
#> character(0)
#> attr(,"strings")
#> named character(0)
#> attr(,"preamble")
#> character(0)

@mwmclean Any insights? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants