Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for improving the vignettes #35

Open
reikookamoto opened this issue Nov 6, 2023 · 17 comments
Open

Suggestions for improving the vignettes #35

reikookamoto opened this issue Nov 6, 2023 · 17 comments

Comments

@reikookamoto
Copy link
Collaborator

reikookamoto commented Nov 6, 2023

  • Adopt conventions when writing variable, function, and package names in addition to file paths
  • Clarify terminology
    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes
  • Proofread for grammatical and spelling errors
  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?
  • Are there data type checks that confirm that the data in the sheets are of the correct data type?
    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?
@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 6, 2023

how_to_install.Rmd

  • Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
  • Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 7, 2023

variables_sheet.Rmd

  • State the source of the sample data
  • State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)
    • Provide the expected data types for each scenario
  • Unclear how variables was created in the rendered documentation
    • The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
  • "Try sorting the subject column by clicking the up..."
    • The top 10 rows of the table doesn't show what's expected according to the vignette
      • You'd need to filter(subject == "lab") to see what's expected
  • Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
  • Can we clarify the difference between section and subject?
  • Can the interactive table take up less space?
  • The section "Derived Variables" looks incomplete

@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 7, 2023

variable_details.Rmd

  • Explicitly state purpose of vignette (i.e., how to organize a details sheet)
  • Unclear how variable_details was created in the rendered documentation
    • The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported
  • "Each row in variable_details gives instructions to recode a single category of a final variable"
    • This statement doesn't apply to non-categorical variables
  • Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?
    • It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
  • Can the tables take up less space?
    • Lots of scrolling is required to read through the vignette
    • Can we also exclude the row ID values?
  • As a novice user, I found it difficult to wrap my head around the information included in Table 1
    • It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
  • variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."
    • Does the original variable name need to be put in square brackets?
  • recStart: consider adding more information on using the square brackets
    • Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
  • "the function will not work if there different units between the rows of the same variable..."
    • Does the function throw a meaningful error?
  • "In variableStart, instead of database names being listed, DerivedVar:: is written..."
    • I think it should be "variable names" instead of "database names"
  • DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?
  • How does recodeflow know whether the derived variable function/reference table is available for recoding?
    • Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?
      • Explanation provided here, but should be moved to a more obvious location:
        **Load the custom function** into your R environment. Load the customized function by either:
        - entering your functions into the console and running the code, or
        - attaching the functions to your own package using the build and install tool. Then load your package using "library("name of package")" or by using the `rec_with_table` parameter to pass the path to your function R script.
        If you don't load the customized function you cannot create the derived variable.

@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 7, 2023

missing_data_(tagged_na).Rmd

  • Avoid parentheses in file names
  • "Base R supports only one type of NA..."
    • Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)
  • "Summary of tagged_na values and their corresponding category values."
    • Where did these category values come from? Is this some sort of domain knowledge?
  • Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation

@reikookamoto
Copy link
Collaborator Author

rec_with_table.Rmd

  • I think most of this information should be moved into the function documentation

@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 7, 2023

how_to_recode.Rmd

  • Example 1
    • Why is it necessary to pass var_labels = c(sex = "sex")
      • We know var_labels is an optional argument, but what are the circumstances under which you want to use it?
  • Example 2
    • Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)
    • Why do we need to call set_data_labels() (and what does it mean for label to be lost)?
    • Why do we want labelled data?
  • Example 3
    • Why do we call bind_rows() without calling set_data_labels()?
  • Example 4
    • Same questions as example 2
  • Example 5
    • The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins
    • Again, why are calling set_data_labels()?
  • Example 6
    • Why are not using set_data_labels() in this example?
  • Example 7
    • Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
    • Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?
  • What other common operations should be included in the vignette?

@reikookamoto
Copy link
Collaborator Author

reikookamoto commented Nov 8, 2023

how_to_use_recodeflow_with_your_data.Rmd

  • This file contains similar information as variable_details.Rmd and variables_sheet.Rmd - there appears to be fewer grammatical errors in this version
  • This vignette suggests that the recoding process should be specified on a details sheet before organizing the variables sheet; if this is the suggested workflow, this information should be placed in a higher-level overview

@DougManuel
Copy link
Contributor

In general,

The suggestions look great. I suggest addressing anything that looks uncontroversial (reducing redundancy, fixing grammar, clarifying...) and isn't a breaking change. You could PR the changes and get feedback on specifics.

Another way of replying is saying there are quite a few comments and suggestions and so it is difficult to respond to all of them in this thread. Let's have separate issue discussions for a few questions.

That said, here is a few specific responses.

  • We decided to use tidyverse style guide. Let's add that to the 'contributing' section or somewhere in the documentation/repo. We have camelCase for variable and variableDetails, which could be kept as is because it is a distinct file, and we felt changing that would be too much of a breaking change.
  • I suggest using gt for tables. GT has many similar features and it is maintained by Rstudio.
  • Can fields be left blank in sheets? Good question. It requires further discussion, I think. My first thought is 'no' for most. 'notes' and some fields seem clearly optional. I am happy to be opinionated and say labels need to be completed, for instance. After all, you can't make a git commit without a subject -- and we are all happy about that.
  • we need better data type checks just about everywhere, IMO. In the sheets, derived functions, etc.

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

  • Adopt conventions when writing variable, function, and package names in addition to file paths

  • Clarify terminology

    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes

  • Proofread for grammatical and spelling errors

  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)

  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)

  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?

  • Are there data type checks that confirm that the data in the sheets are of the correct data type?

    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?

  • For the package name convention, did you mean for the flow style packages?
  • What convention were you thinking for file paths?
  • Aren't tester1 and tester2 already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset and database interchangeably within the context recodeflow.
  • For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?
  • re blank values, we use the convention of encoding missing values with NA instead of blanks.
  • re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.
  • re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

how_to_install.Rmd

  • Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
  • Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

These suggestion sound good. With runiverse, how would users install the development version?

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

variables_sheet.Rmd

  • State the source of the sample data

  • State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

    • Provide the expected data types for each scenario
  • Unclear how variables was created in the rendered documentation

    • The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
  • "Try sorting the subject column by clicking the up..."

    • The top 10 rows of the table doesn't show what's expected according to the vignette
      • You'd need to filter(subject == "lab") to see what's expected
  • Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?

  • Can we clarify the difference between section and subject?

  • Can the interactive table take up less space?

  • The section "Derived Variables" looks incomplete

  • re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?
  • re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
  • re input validation, definitely. Ideally we would have a function that validates the sheets.

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

variable_details.Rmd

  • Explicitly state purpose of vignette (i.e., how to organize a details sheet)

  • Unclear how variable_details was created in the rendered documentation

    • The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported
  • "Each row in variable_details gives instructions to recode a single category of a final variable"

    • This statement doesn't apply to non-categorical variables
  • Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?

    • It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
  • Can the tables take up less space?

    • Lots of scrolling is required to read through the vignette
    • Can we also exclude the row ID values?
  • As a novice user, I found it difficult to wrap my head around the information included in Table 1

    • It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
  • variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."

    • Does the original variable name need to be put in square brackets?
  • recStart: consider adding more information on using the square brackets

    • Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
  • "the function will not work if there different units between the rows of the same variable..."

    • Does the function throw a meaningful error?
  • "In variableStart, instead of database names being listed, DerivedVar:: is written..."

    • I think it should be "variable names" instead of "database names"
  • DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?

  • How does recodeflow know whether the derived variable function/reference table is available for recoding?

    • Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?

      • Explanation provided here, but should be moved to a more obvious location:
        **Load the custom function** into your R environment. Load the customized function by either:
        - entering your functions into the console and running the code, or
        - attaching the functions to your own package using the build and install tool. Then load your package using "library("name of package")" or by using the `rec_with_table` parameter to pass the path to your function R script.
        If you don't load the customized function you cannot create the derived variable.
  • re how the sheet was created, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
  • re Does the original variable name need to be put in square brackets, only the default start variable needs to be put in square brackets. For example, if the databaseStart column for a row is db1; db2; db3; db4 and the variableStart column is db1::var_1; db2::var_3; [var_3]. This is effectively saying, for db1 the start variable is var_1, for db2 the start variable is var_2, for all other databases (db3, db4) use var_3.
  • re recStart, there are two kinds of information that can be put in here,
    1. Singular numbers: For example 1.
    2. Number ranges: For example 1 - 4. This uses the interval notation from mathematics. For example, to represent a range from 1 to 4 that does not include 1, the value for recStart would be (1,4]
  • re units error, I'm pretty sure it does not
  • re order of variables in DerivedVar, yes the order should match up with the parameters function that is used to recode the derived variable. No limit on the number of parameters.

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

missing_data_(tagged_na).Rmd

  • Avoid parentheses in file names

  • "Base R supports only one type of NA..."

    • Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)
  • "Summary of tagged_na values and their corresponding category values."

    • Where did these category values come from? Is this some sort of domain knowledge?
  • Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation

  • re base R supports only one type of NA, I believe this is referring to the fact that you can't have different NAs for different missing reasons. For example, an NA if the user did not respond to the question, an NA type if the question was skipped for a user etc.
  • These come from the CCHS dataset which is what this package was built on. But are probably not applicable to other databases or we may need to expand the list.

@yulric
Copy link
Collaborator

yulric commented Nov 14, 2023

how_to_recode.Rmd

  • Example 1

    • Why is it necessary to pass var_labels = c(sex = "sex")

      • We know var_labels is an optional argument, but what are the circumstances under which you want to use it?
  • Example 2

    • Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)
    • Why do we need to call set_data_labels() (and what does it mean for label to be lost)?
    • Why do we want labelled data?
  • Example 3

    • Why do we call bind_rows() without calling set_data_labels()?
  • Example 4

    • Same questions as example 2
  • Example 5

    • The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins
    • Again, why are calling set_data_labels()?
  • Example 6

    • Why are not using set_data_labels() in this example?
  • Example 7

    • Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
    • Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?
  • What other common operations should be included in the vignette?

  • re Example 1, when the user does not use a variables sheet but passes in the variables they want to recode as a vector of a strings, they can set the labels for the variables using this parameter.
  • re Example 2, see reason in example 1 above.
    • The capitalization is probably a mistake.
    • re loosing labels, I believe this was an issue where the labels were being removed from the data when they were passed through a dplyr function. Hence this function.
  • re Example 3, Probably a mistake
  • re Example 4, see example 1 above.
  • re Example 5, see example 1 above.
  • re Example 6, probably a mistake
  • re Examle 7, I agree about a single file for Derived variables. I don't think that's a common measurement so good to change.

@reikookamoto
Copy link
Collaborator Author

  • Adopt conventions when writing variable, function, and package names in addition to file paths

  • Clarify terminology

    • What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
    • tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
  • Clarify the order in which a new user should read the vignettes

  • Proofread for grammatical and spelling errors

  • Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)

  • Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)

  • DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

Questions that crossed my mind

  • Can fields be left blank in the sheets?

  • Are there data type checks that confirm that the data in the sheets are of the correct data type?

    • E.g., we wouldn't expect numeric data in the label column of the variables sheet
  • Under what circumstances does the details sheet get updated after calling rec_with_table()?

  • For the package name convention, did you mean for the flow style packages?
  • What convention were you thinking for file paths?
  • Aren't tester1 and tester2 already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset and database interchangeably within the context recodeflow.
  • For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?
  • re blank values, we use the convention of encoding missing values with NA instead of blanks.
  • re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.
  • re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.
  • Package naming convention: I meant being consistent with writing dplyr versus dplyr
  • File paths: I want to adopt the here package conventions for file referencing
  • Redundancy: Yup, some of the information appears in multiple places
  • Blank values: I meant what happens if a user passes in a variables sheet with variable filled in but label, labelLong, etc. are not provided for a given row?

@reikookamoto
Copy link
Collaborator Author

how_to_install.Rmd

  • Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
  • Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

These suggestion sound good. With runiverse, how would users install the development version?

I think it's possible to get R-universe to track and build the development branch of a particular GitHub repo: https://ropensci.org/blog/2021/06/22/setup-runiverse/#pro-tip-tracking-custom-branches-or-releases

@reikookamoto
Copy link
Collaborator Author

variables_sheet.Rmd

  • State the source of the sample data

  • State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

    • Provide the expected data types for each scenario
  • Unclear how variables was created in the rendered documentation

    • The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
  • "Try sorting the subject column by clicking the up..."

    • The top 10 rows of the table doesn't show what's expected according to the vignette
      • You'd need to filter(subject == "lab") to see what's expected
  • Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?

  • Can we clarify the difference between section and subject?

  • Can the interactive table take up less space?

  • The section "Derived Variables" looks incomplete

  • re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?
  • re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
  • re input validation, definitely. Ideally we would have a function that validates the sheets.
  • Each scenario: I meant something along the lines of (1) if someone wants to use their own variables sheet, they should import a CSV file into R using a function like readr::read_csv() or (2) if someone wants to use a variables sheet that's part of a package, they should first install and then load that package. But ignore what I said earlier since the variables sheet is a data.frame in both scenarios.
  • Instead of sourcing test-data-generator.R in the vignette, we could include the cleaned data as part of the package: https://r-pkgs.org/data.html#sec-data-data-raw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants