Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert function #6

Closed
pitkant opened this issue Sep 6, 2024 · 11 comments
Closed

convert function #6

pitkant opened this issue Sep 6, 2024 · 11 comments

Comments

@pitkant
Copy link

pitkant commented Sep 6, 2024

Hi, in the internal extractData function

`extractData` <- function(xml) {

more specifically in the lines 561-577

DDIwR/R/internals.R

Lines 561 to 577 in 6a24c63

if (all(is.element(
c("xml_document", "xml_node"),
class(xml)
))) {
dns <- getDNS(xml) # default name space
xpath <- sprintf("/%scodeBook/%sfileDscr/%snotes", dns, dns, dns)
elements <- xml2::xml_find_all(xml, xpath)
attrs <- lapply(elements, xml2::xml_attrs)
wdata <- which(sapply(elements, function(x) {
xml_attr(x, "ID") == "rawdata" &&
grepl("serialized", xml_attr(x, "subject"))
}))
if (length(wdata) > 0) {
notes <- xml2::xml_text(elements[wdata])
}

the function attempts to find data from xpath "/d1:codeBook/d1:fileDscr/d1:notes". In my file variable elements produces {xml_nodeset (0)} (list of 0) and consequently variable attrs is an empty list list() (list of 0). When the function proceeds to lines 570-573

DDIwR/R/internals.R

Lines 570 to 573 in 6a24c63

wdata <- which(sapply(elements, function(x) {
xml_attr(x, "ID") == "rawdata" &&
grepl("serialized", xml_attr(x, "subject"))
}))

It is a problem since my file only has notes elements under docDscr and not fileDscr, containing data like this:

<notes xml:lang="en">FSD:n aineistokuvailut (FSD metadata records) by Suomen yhteiskuntatieteellinen tietoarkisto (Finnish Social Science Data Archive) are licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
      <ExtLink URI="https://creativecommons.org/licenses/by/4.0/deed.en" />
    </notes>

so I'm wondering if my file in specific is not up to spec or if this function generally has hard-coding in it that prevents efficient handling of files. When debugging something like this it could be helpful if the package had some example files that could be used for testing and validation.

For context, what I'm attempting to do here is to see whether DDIwR could be used to provide labels for a dataset in .csv format. I have an openly available dataset that has data both in .csv and SPSS's .sav format, alongside the DDI 2.5 XML-file. I would like to try to replicate the .sav file using DDIwR package to see if DDI format would be a useful format for providing metadata with other types of datasets where such neat arrangement is not available. Mainly for datasets that have data only in .csv format (or a .sav file or similar with no metadata) and metadata resides in a separate data resource catalogue.

My function code looked like the following:

convert(from = "/Users/<username>/Downloads/FSD3840/Study/FSD3840_fin.xml", to = "R", csv = "/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.csv")

It could be that since the function examples covered only cases where "SPSS file called test.sav" was converted to other formats my use of this function was not as intended? As stated earlier, since there is no test data provided with the package, I couldn't really run the functions. However, I couldn't run the case that was more in line with the examples either:

convert(from = "/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.sav", to = "DDI")

I got Error: 'x' should contain at least one (possibly) numeric value. as error message. The traceback looked like this:

8.
stop(simpleError(paste0(paste(message, collapse = "\n"), enter,
enter)))
7.
stopError("'x' should contain at least one (possibly) numeric value.")
6.
admisc::numdec(x) at internals.R#975
5.
getFormat(x, type = "Stata") at internals.R#536
4.
FUN(X[[i]], ...)
3.
lapply(dataset, function(x) {
result <- list(classes = class(x))
label <- attr(x, "label", exact = TRUE)
if (!is.null(label)) { ... at internals.R#458
2.
collectRMetadata(data) at convert.R#571
1.
convert(from = "/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.sav",
to = "DDI")

Am I using the function wrong? Or is there a problem in the code?

The dataset I used for testing is FSD3840 Citizen's Pulse 10/2023 and the data and metadata can be freely downloaded in Finnish Social Science Data Archive website: https://services.fsd.tuni.fi/catalogue/FSD3840?tab=download&study_language=en&lang=en

(you need to use the Finnish metadata file as the English metadata file only has general dataset descriptions and no fileDscr for example)

@dusadrian
Copy link
Owner

Hello Pyry,

I get no error converting daF3840_fin.sav to DDI Codebook, the .xml file is produced just fine. To replicate that error, I would need the output from the following two commands: R.version and session.info().

I also get no error converting the (Finnish version of the) metadata plus the .csv file, the result is a dataset containing 1132 rows and 78 variables:

fsd <- convert("/Users/dusadrian/Downloads/FSD3840/Study/Data/daF3840_fin.xml")

head(fsd$q23)
<declared<numeric>[6]> [q23] Kuinka tyytyväinen olet demokratian toimivuuteen Suomessa?
[1] 3 3 3 3 3 4

Labels:
 value                    label
     0            En osaa sanoa
     1  En lainkaan tyytyväinen
     2 En kovinkaan tyytyväinen
     3        Melko tyytyväinen
     4     Erittäin tyytyväinen
     9           En halua sanoa

In your R command, specifying the to = "R" argument creates an R dataset as a physical file. In my case, it just imports it within the R environment. Since the .csv file is located inside the same subfolder and it has the same name as the .xml file, the function convert() automatically finds it there. The csv path argument is only needed if the .csv file is located elsewhere and/or named differently.

Specific to the function extractData(), that is an internal one that identifies a specific type of notes in a specific location (codeBook/fileDscr/notes), where a specific type of data (serialized, gzip-ed) is located. All other notes, containing anything else, are left in place and read as they are.

Additionally, as the notes element is repeatable, there might even be multiple notes inside the codeBook/fileDescr element, but only that specific one is read and interpreted as a dataset. The other one(s) are read as text, and dedicated functions can be written to parse that text and get interpreted in various ways.

There are examples and tests converting to DDI Codebook and back to R, more specifically lines 56-57.

It is just difficult to insert such examples in the appropriate section of the documentation, making the CRAN checks very heavy.

@pitkant
Copy link
Author

pitkant commented Sep 10, 2024

Thank you for your response and confirming that I was not completely on the wrong track when I attempted to use the function. However, I couldn't replicate your code, I continued to get the same error: Error: 'x' should contain at least one (possibly) numeric value. when attempting to run the following code:

convert(from = "/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.sav", to = "DDI")

Here's my sessionInfo():

R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Helsinki
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] haven_2.5.4   declared_0.24 admisc_0.35   DDIwR_0.18.4 

loaded via a namespace (and not attached):
 [1] digest_0.6.37     utf8_1.2.4        base64enc_0.1-3   cellranger_1.1.0 
 [5] readxl_1.4.3      magrittr_2.0.3    glue_1.7.0        tibble_3.2.1     
 [9] pkgconfig_2.0.3   lifecycle_1.0.4   xml2_1.3.6        cli_3.6.3        
[13] fansi_1.0.6       vctrs_0.6.5       compiler_4.4.0    forcats_1.0.0    
[17] writexl_1.5.0     rstudioapi_0.16.0 tools_4.4.0       hms_1.1.3        
[21] pillar_1.9.0      rlang_1.1.4      

(DDIwR_0.18.4 refers to the latest version available here on Github in master branch)

@dusadrian
Copy link
Owner

That error is most likely triggered by package admisc, which is some minor versions newer than CRAN's 0.35.
Also, package declared is one major version newer, even on CRAN at 0.25

So just to make sure, could you please re-install all from GitHub? I use:

remotes::install_github("dusadrian/admisc")
remotes::install_github("dusadrian/declared")
remotes::install_github("dusadrian/DDIwR")

If the error still persists, perhaps a traceback() outcome would help.

@pitkant
Copy link
Author

pitkant commented Sep 10, 2024

Thank you, that was it!

Specific to the function extractData(), that is an internal one that identifies a specific type of notes in a specific location (codeBook/fileDscr/notes), where a specific type of data (serialized, gzip-ed) is located. All other notes, containing anything else, are left in place and read as they are.

Additionally, as the notes element is repeatable, there might even be multiple notes inside the codeBook/fileDescr element, but only that specific one is read and interpreted as a dataset. The other one(s) are read as text, and dedicated functions can be written to parse that text and get interpreted in various ways.

It is actually now that I got to play with the function hands-on that I realised that without embed = FALSE the function indeed embeds the serialised data into the .xml file. A little surprising to me, maybe a little unorthodox? But I admire the ingenuity.

However, while I could run the line

fsd <- convert("/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.xml")

with no problem when I used embed = TRUE

convert("/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.sav", to = "DDI", embed = TRUE)

I got the same error message as before when I had set embed = FALSE.

convert("/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.sav", to = "DDI", embed = FALSE)
> fsd <- convert("/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.xml")

Error in which(sapply(elements, function(x) { : 
  argument to 'which' is not logical
3.
which(sapply(elements, function(x) {
xml_attr(x, "ID") == "rawdata" && grepl("serialized", xml_attr(x,
"subject"))
})) at internals.R#570
2.
extractData(xml) at convert.R#286
1.
convert("/Users/<username>/Downloads/FSD3840/Study/Data/daF3840_fin.xml")

The .xml file and the .csv file were in the same folder and had the same name as the convert function had generated them both.

@dusadrian
Copy link
Owner

Good catch, I think now it works alright.
Could you try to re-install package DDIwR from the latest commit on GitHub?

@pitkant
Copy link
Author

pitkant commented Sep 11, 2024

Thank you, the latest commits made everything work! This is great.

Returning to the first message in this issue, in the end what I actually wanted to do was to use the /Users/<username>/Downloads/FSD3840/Study/FSD3840_fin.xml DDI file to label the /Users/pyrykantanen/Downloads/FSD3840/Study/Data/daF3840_fin.csv file.

I had two hurdles:

  1. The separator used in daF3840_fin.csv is ; instead of ,. Luckily this can be passed to the utils::read.csv function call on line on line 339 of convert.R by using ... parameters so all is good. This is more of a note to myself.
  2. The comparison done on line 350 of convert.R if (!identical(names(data), names(variables))) didn't pass since the names from the (.csv) data file are in lowercase and the names from (.xml DDI) variables are in uppercase. I turned the names in the datafile to uppercase in debug mode and then everything went smoothly. Maybe something to consider since failing the identical check due to uppercase/lowercase characters doesn't sound ideal to me?
Browse[1]> names(data)
 [1] "fsd_no" "fsd_vr" "fsd_id" "fsd_li" "t1"     "t2"     "t5"     "t6"     "t3"     "q23"    "q1"     "q1_b"   "q1_c"   "q1_d"   "q1_e"   "q1_f"   "q1_g"  
[18] "q1_h"   "q5"     "q6"     "q7"     "q34"    "q8"     "q11k"   "q11l"   "q11m"   "q11n"   "q11o"   "q11p"   "q11q"   "q11r"   "q11s"   "q11t"   "q11x"  
[35] "q11y"   "q11z"   "q45b"   "q55_1"  "q55_2"  "q55_3"  "q55_4"  "q55_5"  "q55_6"  "q55_7"  "q55_8"  "q55_9"  "q55_10" "q55_11" "q55_12" "q55_13" "q55_14"
[52] "q56a"   "q56b"   "q37j"   "q35"    "q36b"   "q36d"   "q36e"   "q37b"   "q37c"   "q37g"   "q37h"   "q37i"   "q37d"   "q37e"   "q37f"   "q44a"   "q44b"  
[69] "q21g"   "q21h"   "q21c"   "q21f"   "q4a"    "q18w"   "t4"     "t7"     "bv1"    "paino" 
Browse[1]> names(variables)
 [1] "FSD_NO" "FSD_VR" "FSD_ID" "FSD_LI" "T1"     "T2"     "T5"     "T6"     "T3"     "Q23"    "Q1"     "Q1_B"   "Q1_C"   "Q1_D"   "Q1_E"   "Q1_F"   "Q1_G"  
[18] "Q1_H"   "Q5"     "Q6"     "Q7"     "Q34"    "Q8"     "Q11K"   "Q11L"   "Q11M"   "Q11N"   "Q11O"   "Q11P"   "Q11Q"   "Q11R"   "Q11S"   "Q11T"   "Q11X"  
[35] "Q11Y"   "Q11Z"   "Q45B"   "Q55_1"  "Q55_2"  "Q55_3"  "Q55_4"  "Q55_5"  "Q55_6"  "Q55_7"  "Q55_8"  "Q55_9"  "Q55_10" "Q55_11" "Q55_12" "Q55_13" "Q55_14"
[52] "Q56A"   "Q56B"   "Q37J"   "Q35"    "Q36B"   "Q36D"   "Q36E"   "Q37B"   "Q37C"   "Q37G"   "Q37H"   "Q37I"   "Q37D"   "Q37E"   "Q37F"   "Q44A"   "Q44B"  
[69] "Q21G"   "Q21H"   "Q21C"   "Q21F"   "Q4A"    "Q18W"   "T4"     "T7"     "BV1"    "PAINO" 

@dusadrian
Copy link
Owner

I see.
For SPSS and Stata, variable naming (upper / lower case) should not matter. From that perspective, I could easily check by converting everything to lower case and then compare.
But DDI is not only for SPSS and Stata, I imagine this can also be used for pure R datasets, where the case does matter.

I guess a more descriptive error message could be introduced (see the latest commit), where the user is now more informed as to why the .csv file does not match the DDI .xml codebook.

Any suggestion is welcome, I agree it doesn't sound ideal but I don't know any other way to avoid comparing apples and oranges.

@pitkant
Copy link
Author

pitkant commented Sep 11, 2024

I agree it's a good idea to keep the identical function as tight as possible. More informative error messages are always very welcome in my opinion.

  1. Maybe some of the complications could be removed if there was an option, instead of pointing to a .csv file on the disk, to pre-emptively read the csv object into memory using appropriate method for the appropriate delimiter (read.csv or read.csv2), after which you could do manipulations on the header row (toupper(), tolower(), maybe something else as well)). Then the object in memory could be labeled by using the DDI codebook.
  2. Other option might be to treat uppercase and lowercase as a special case where the ifelse in lines 354-355

DDIwR/R/convert.R

Lines 352 to 358 in 6f2ee52

sprintf(
"The .csv file does not match the DDI Codebook%s.",
ifelse(
identical(tolower(names(data)), tolower(names(variables))),
", the variable names have differences in upper / lower case",
""
)

would be taken out of the admisc::stopError call - the function could proceed if an only if the names were identical when they're cases were harmonised (and maybe emit a message instead of a warning), and throw the error only if controlling for character cases does not yield an identical result. Like this:

            if (ncol(data) == length(variables)) {
                if (header) {
                    if (!identical(names(data), names(variables))) {
                      if (identical(tolower(names(data)), tolower(names(variables)))){
                        names(data) <- tolower(names(variables))
                        message("The variable names have have been translated to lower case to ensure equality between data and DDI codebook")
                      } else {
                        admisc::stopError("The .csv file does not match the DDI Codebook")
                      }
                    }
                } else {
                  names(data) <- names(variables)
                }
            }

The risk of this kind of comparison accidentally ruining the dataset labels is very slim, non-existent even.

I guess line

names(data) <- tolower(names(variables))

could just be

names(data) <- names(variables)

meaning that the labels in the DDI codebook take precedence over the csv names.

@dusadrian
Copy link
Owner

dusadrian commented Sep 11, 2024

The latest commit now accepts an R data.frame as input for the "csv" argument, I think this is a decent addition:

csv <- read.csv("/Users/dusadrian/Downloads/FSD3840/Study/Data/daF3840_fin.csv")
# (use your own delimiters, etc.)
fsd <- convert("/Users/dusadrian/Downloads/FSD3840/Study/Data/daF3840_fin.xml", csv = csv)

Your other suggestion to lower the case for all variables, I think it is not a good idea. Again, the point is less to compare SPSS upper case variables with SPSS lower case variables, but also to compare R (mixed) lower and upper case variable names.

Something like:

foo <- data.frame(aa = 1:2, aA = 1:2, Aa = 1:2)
#   aa aA Aa
# 1  1  1  1
# 2  2  2  2

To R, these are three different variables, unlike SPSS and Stata. Lowering the case for all three variables would create a mess, and there is no guarantee such a situation is impossible in real life. DDI is not only about SPSS, and R is case sensitive...

@pitkant
Copy link
Author

pitkant commented Sep 12, 2024

Yes, it's true that R makes the distinction between aa and aA and Aa. I think it's not a good convention to data.frames with such names (and actually, wouldn't a csv file / data.frame with such names be impossible to label with a DDI file?) but since it's possible in R your point is very much valid.

I think reading the csv file separately and then making sure that the names match is the most elegant solution in this case.

Again, thank you for your work and accommodating my requests. Actually I think the work done within the scope of this issue might help solve issues #3 and #4 in the future as well. I might have some feature requests related to combining csv and DDI files but I'll make a separate issue / PR if need arises.

@dusadrian
Copy link
Owner

You're more than welcome, please keep the suggestions coming.
I can do as much as my imagination allows, but it is the end-users' real life case scenarios that should drive the development.
I am now working on a video tutorial on how do document a dataset using R commands, and all that is the base foundation for a proper guide, or why not a more extensive text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants