Suggestions for improving the vignettes #35

reikookamoto · 2023-11-06T22:39:42Z

Adopt conventions when writing variable, function, and package names in addition to file paths
Clarify terminology
- What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata
- tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data
Clarify the order in which a new user should read the vignettes
Proofread for grammatical and spelling errors
Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)
Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)
DT package (used across several files) is in maintenance-only mode so consider using another package for making tables
- https://gt.rstudio.com/
- https://hughjonesd.github.io/huxtable/

Questions that crossed my mind

Can fields be left blank in the sheets?
Are there data type checks that confirm that the data in the sheets are of the correct data type?
- E.g., we wouldn't expect numeric data in the label column of the variables sheet
Under what circumstances does the details sheet get updated after calling rec_with_table()?

The text was updated successfully, but these errors were encountered:

reikookamoto · 2023-11-06T22:58:21Z

how_to_install.Rmd

Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)
Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:

# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)

If we put our package on the R-universe, users can install the development version simply using install.packages()
- https://jeroen.github.io/runiverse2021/#15

reikookamoto · 2023-11-07T14:39:27Z

variables_sheet.Rmd

State the source of the sample data
State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)
- Provide the expected data types for each scenario
Unclear how variables was created in the rendered documentation
- The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported
"Try sorting the subject column by clicking the up..."
- The top 10 rows of the table doesn't show what's expected according to the vignette
- - You'd need to filter(subject == "lab") to see what's expected
Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?
Can we clarify the difference between section and subject?
Can the interactive table take up less space?
The section "Derived Variables" looks incomplete
- Complete it or at least define what a derived variable is and then point users to another vignette
- You can link one vignette to another - https://r-pkgs.org/vignettes.html#links

reikookamoto · 2023-11-07T15:20:23Z

variable_details.Rmd

Explicitly state purpose of vignette (i.e., how to organize a details sheet)
Unclear how variable_details was created in the rendered documentation
- The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported
"Each row in variable_details gives instructions to recode a single category of a final variable"
- This statement doesn't apply to non-categorical variables
Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?
- It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)
Can the tables take up less space?
- Lots of scrolling is required to read through the vignette
- Can we also exclude the row ID values?
As a novice user, I found it difficult to wrap my head around the information included in Table 1
- It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?
variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."
- Does the original variable name need to be put in square brackets?
recStart: consider adding more information on using the square brackets
- Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?
"the function will not work if there different units between the rows of the same variable..."
- Does the function throw a meaningful error?
"In variableStart, instead of database names being listed, DerivedVar:: is written..."
- I think it should be "variable names" instead of "database names"
DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?

How does recodeflow know whether the derived variable function/reference table is available for recoding?

Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?

Explanation provided here, but should be moved to a more obvious location:

recodeflow/vignettes/derived_variables.Rmd

Lines 54 to 59 in 04bcea0

    
           **Load the custom function** into your R environment. Load the customized function by either: 
        
           - entering your functions into the console and running the code, or  
        
           - attaching the functions to your own package using the build and install tool. Then load your package using "library("name of package")" or by using the `rec_with_table` parameter to pass the path to your function R script. 
        
           If you don't load the customized function you cannot create the derived variable.

reikookamoto · 2023-11-07T16:32:50Z

missing_data_(tagged_na).Rmd

Avoid parentheses in file names
"Base R supports only one type of NA..."
- Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)
"Summary of tagged_na values and their corresponding category values."
- Where did these category values come from? Is this some sort of domain knowledge?
Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation
- Include a link instead - https://haven.tidyverse.org/reference/tagged_na.html

reikookamoto · 2023-11-07T18:59:09Z

rec_with_table.Rmd

I think most of this information should be moved into the function documentation

reikookamoto · 2023-11-07T19:32:23Z

how_to_recode.Rmd

Example 1
- Why is it necessary to pass var_labels = c(sex = "sex")
  - We know var_labels is an optional argument, but what are the circumstances under which you want to use it?
Example 2
- Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)
- Why do we need to call set_data_labels() (and what does it mean for label to be lost)?
- Why do we want labelled data?
Example 3
- Why do we call bind_rows() without calling set_data_labels()?
Example 4
- Same questions as example 2
Example 5
- The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins
- Again, why are calling set_data_labels()?
Example 6
- Why are not using set_data_labels() in this example?
Example 7
- Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?
- Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?
What other common operations should be included in the vignette?

reikookamoto · 2023-11-08T03:35:10Z

how_to_use_recodeflow_with_your_data.Rmd

This file contains similar information as variable_details.Rmd and variables_sheet.Rmd - there appears to be fewer grammatical errors in this version
This vignette suggests that the recoding process should be specified on a details sheet before organizing the variables sheet; if this is the suggested workflow, this information should be placed in a higher-level overview

DougManuel · 2023-11-09T18:59:07Z

In general,

The suggestions look great. I suggest addressing anything that looks uncontroversial (reducing redundancy, fixing grammar, clarifying...) and isn't a breaking change. You could PR the changes and get feedback on specifics.

Another way of replying is saying there are quite a few comments and suggestions and so it is difficult to respond to all of them in this thread. Let's have separate issue discussions for a few questions.

That said, here is a few specific responses.

We decided to use tidyverse style guide. Let's add that to the 'contributing' section or somewhere in the documentation/repo. We have camelCase for variable and variableDetails, which could be kept as is because it is a distinct file, and we felt changing that would be too much of a breaking change.
I suggest using gt for tables. GT has many similar features and it is maintained by Rstudio.
Can fields be left blank in sheets? Good question. It requires further discussion, I think. My first thought is 'no' for most. 'notes' and some fields seem clearly optional. I am happy to be opinionated and say labels need to be completed, for instance. After all, you can't make a git commit without a subject -- and we are all happy about that.
we need better data type checks just about everywhere, IMO. In the sheets, derived functions, etc.

yulric · 2023-11-14T14:24:23Z

Adopt conventions when writing variable, function, and package names in addition to file paths

Clarify terminology

What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata

tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data

Clarify the order in which a new user should read the vignettes

Proofread for grammatical and spelling errors

Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)

Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)

DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

https://gt.rstudio.com/

https://hughjonesd.github.io/huxtable/

Questions that crossed my mind

Can fields be left blank in the sheets?

Are there data type checks that confirm that the data in the sheets are of the correct data type?

E.g., we wouldn't expect numeric data in the label column of the variables sheet

Under what circumstances does the details sheet get updated after calling rec_with_table()?

For the package name convention, did you mean for the flow style packages?
What convention were you thinking for file paths?
Aren't tester1 and tester2 already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset and database interchangeably within the context recodeflow.
For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?
re blank values, we use the convention of encoding missing values with NA instead of blanks.
re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.
re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.

yulric · 2023-11-14T14:31:39Z

how_to_install.Rmd

Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)

Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
If we put our package on the R-universe, users can install the development version simply using install.packages()

https://jeroen.github.io/runiverse2021/#15

These suggestion sound good. With runiverse, how would users install the development version?

yulric · 2023-11-14T14:40:03Z

variables_sheet.Rmd

State the source of the sample data

State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

Provide the expected data types for each scenario

Unclear how variables was created in the rendered documentation

The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported

"Try sorting the subject column by clicking the up..."

The top 10 rows of the table doesn't show what's expected according to the vignette

You'd need to filter(subject == "lab") to see what's expected

Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?

Can we clarify the difference between section and subject?

Can the interactive table take up less space?

The section "Derived Variables" looks incomplete

Complete it or at least define what a derived variable is and then point users to another vignette

You can link one vignette to another - https://r-pkgs.org/vignettes.html#links

re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?
re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
re input validation, definitely. Ideally we would have a function that validates the sheets.

yulric · 2023-11-14T14:55:49Z

variable_details.Rmd

Explicitly state purpose of vignette (i.e., how to organize a details sheet)

Unclear how variable_details was created in the rendered documentation

The sheet is referred to as variable_details.csv but there's no indication that a CSV file was imported

"Each row in variable_details gives instructions to recode a single category of a final variable"

This statement doesn't apply to non-categorical variables

Can we refrain from using the dollar sign "$" to refer to a variable in the dataset?

It's more likely that users will interact with these sheets via Excel (i.e., using base R syntax may be confusing for novice users of the package)

Can the tables take up less space?

Lots of scrolling is required to read through the vignette

Can we also exclude the row ID values?

As a novice user, I found it difficult to wrap my head around the information included in Table 1

It's helpful to know (1) which columns are optional and (2) that the order in which the columns appear don't matter but are the other pieces of information necessary?

variableStart: "If the variable name in a particular dataset is different from the recoded variable name..."

Does the original variable name need to be put in square brackets?

recStart: consider adding more information on using the square brackets

Can we only use square brackets? Is a meaninful error thrown if a user tries to include round parentheses in their details sheet?

"the function will not work if there different units between the rows of the same variable..."

Does the function throw a meaningful error?

"In variableStart, instead of database names being listed, DerivedVar:: is written..."

I think it should be "variable names" instead of "database names"

DerivedVar::[var1, var2, var3]: Does the order in which you put the variables in the square brackets matter? Is there a limit to the number of parameters you can pass?

How does recodeflow know whether the derived variable function/reference table is available for recoding?

Does the R script containing the function/reference table definitions need to sit in a specific location on the machine?

Explanation provided here, but should be moved to a more obvious location:

recodeflow/vignettes/derived_variables.Rmd

Lines 54 to 59 in 04bcea0

**Load the custom function** into your R environment. Load the customized function by either:

- entering your functions into the console and running the code, or

- attaching the functions to your own package using the build and install tool. Then load your package using "library("name of package")" or by using the `rec_with_table` parameter to pass the path to your function R script.

If you don't load the customized function you cannot create the derived variable.

re how the sheet was created, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.
re Does the original variable name need to be put in square brackets, only the default start variable needs to be put in square brackets. For example, if the databaseStart column for a row is db1; db2; db3; db4 and the variableStart column is db1::var_1; db2::var_3; [var_3]. This is effectively saying, for db1 the start variable is var_1, for db2 the start variable is var_2, for all other databases (db3, db4) use var_3.
re recStart, there are two kinds of information that can be put in here,
1. Singular numbers: For example 1.
2. Number ranges: For example 1 - 4. This uses the interval notation from mathematics. For example, to represent a range from 1 to 4 that does not include 1, the value for recStart would be (1,4]
re units error, I'm pretty sure it does not
re order of variables in DerivedVar, yes the order should match up with the parameters function that is used to recode the derived variable. No limit on the number of parameters.

yulric · 2023-11-14T14:59:44Z

missing_data_(tagged_na).Rmd

Avoid parentheses in file names

"Base R supports only one type of NA..."

Consider rephrasing this since base R includes different types of NA (e.g., NA_integer_, NA_real_, NA_character_, NA_complex_)

"Summary of tagged_na values and their corresponding category values."

Where did these category values come from? Is this some sort of domain knowledge?

Is it necessary to include the code chunk when the examples provided are similar to the similar to those in the haven documentation

Include a link instead - https://haven.tidyverse.org/reference/tagged_na.html

re base R supports only one type of NA, I believe this is referring to the fact that you can't have different NAs for different missing reasons. For example, an NA if the user did not respond to the question, an NA type if the question was skipped for a user etc.
These come from the CCHS dataset which is what this package was built on. But are probably not applicable to other databases or we may need to expand the list.

yulric · 2023-11-14T18:46:10Z

how_to_recode.Rmd

Example 1

Why is it necessary to pass var_labels = c(sex = "sex")

We know var_labels is an optional argument, but what are the circumstances under which you want to use it?

Example 2

Why is it necessary to pass var_labels = c(sex = "Sex") (and why is it capitalized this time round)

Why do we need to call set_data_labels() (and what does it mean for label to be lost)?

Why do we want labelled data?

Example 3

Why do we call bind_rows() without calling set_data_labels()?

Example 4

Same questions as example 2

Example 5

The print outs take up a lot of space; it's hard to see where the second rec_with_table() call begins

Again, why are calling set_data_labels()?

Example 6

Why are not using set_data_labels() in this example?

Example 7

Derived variables are discussed across multiple vignettes; is it better for these pieces of information to be consolidated in a single file?

Is chol * bili an actual measurement? Otherwise, could we provide a different example that deals with a measurement used in practice?

What other common operations should be included in the vignette?

re Example 1, when the user does not use a variables sheet but passes in the variables they want to recode as a vector of a strings, they can set the labels for the variables using this parameter.
re Example 2, see reason in example 1 above.
- The capitalization is probably a mistake.
- re loosing labels, I believe this was an issue where the labels were being removed from the data when they were passed through a dplyr function. Hence this function.
re Example 3, Probably a mistake
re Example 4, see example 1 above.
re Example 5, see example 1 above.
re Example 6, probably a mistake
re Examle 7, I agree about a single file for Derived variables. I don't think that's a common measurement so good to change.

reikookamoto · 2023-11-15T17:06:58Z

Adopt conventions when writing variable, function, and package names in addition to file paths

Clarify terminology

What is the difference between a variable name and a label, and how does this relate to other programming languages like Stata

tester1 and tester2 should be referred to as datasets rather than databases; the latter usually refers to an organized collection of data

Clarify the order in which a new user should read the vignettes

Proofread for grammatical and spelling errors

Reduce redundancy in writing (i.e., how_to_use_recodeflow_with_your_data.Rmd contains the same information as variable_details.Rmd and variables_sheet.Rmd)

Make sure each vignette shows a particular workflow from start to finish (e.g., how to organize a variables sheet by walking through a sample, how to fill in a details sheet, how to call the main function to address common tasks)

DT package (used across several files) is in maintenance-only mode so consider using another package for making tables

https://gt.rstudio.com/

https://hughjonesd.github.io/huxtable/

Questions that crossed my mind

Can fields be left blank in the sheets?

Are there data type checks that confirm that the data in the sheets are of the correct data type?

E.g., we wouldn't expect numeric data in the label column of the variables sheet

Under what circumstances does the details sheet get updated after calling rec_with_table()?

For the package name convention, did you mean for the flow style packages?

What convention were you thinking for file paths?

Aren't tester1 and tester2 already being referred to as datasets? At least that's what I'm seeing in the how_to_recode vignette on master. In general, I believe we use dataset and database interchangeably within the context recodeflow.

For the redundancy example, did you mean the text at the bottom of the how_to_use_recodeflow vignette with the columns for the variables sheet?

re blank values, we use the convention of encoding missing values with NA instead of blanks.

re data type checks, we do not have any kind of validation for the data in the sheets but we should definitely have some.

re updating the details sheet during a recode call, ideally never. The sheets should be read-only during a recode process.

Package naming convention: I meant being consistent with writing dplyr versus dplyr
File paths: I want to adopt the here package conventions for file referencing
Redundancy: Yup, some of the information appears in multiple places
Blank values: I meant what happens if a user passes in a variables sheet with variable filled in but label, labelLong, etc. are not provided for a given row?

reikookamoto · 2023-11-15T17:10:59Z

how_to_install.Rmd

Installation instructions should be in README.md instead of a vignette (i.e., this vignette could be removed)

Be aware that anyone who installs directly from GitHub will need to explicitly request vignettes:
# install.packages("devtools")
devtools::install_github("Big-Life-Lab/recodeflow", dependencies = TRUE, build_vignettes = TRUE)
If we put our package on the R-universe, users can install the development version simply using install.packages()

https://jeroen.github.io/runiverse2021/#15
These suggestion sound good. With runiverse, how would users install the development version?

I think it's possible to get R-universe to track and build the development branch of a particular GitHub repo: https://ropensci.org/blog/2021/06/22/setup-runiverse/#pro-tip-tracking-custom-branches-or-releases

reikookamoto · 2023-11-15T19:18:58Z

variables_sheet.Rmd

State the source of the sample data

State that users are expected to write their own variables sheet or use one that's been published as part of another package (e.g., elderflow)

Provide the expected data types for each scenario

Unclear how variables was created in the rendered documentation

The variable sheet is referred to as variables.csv but there's no indication that a CSV file was imported

"Try sorting the subject column by clicking the up..."

The top 10 rows of the table doesn't show what's expected according to the vignette

You'd need to filter(subject == "lab") to see what's expected

Would it be helpful to specify the expected data type of each column in the variables sheet (or implement some kind of input validation)?

Can we clarify the difference between section and subject?

Can the interactive table take up less space?

The section "Derived Variables" looks incomplete

Complete it or at least define what a derived variable is and then point users to another vignette

You can link one vignette to another - https://r-pkgs.org/vignettes.html#links

re provide the data types for each scenario, what do you mean by scenario. Do you mean columns?

re where the variables sheet is coming from, its coming from sourcing the test-data-generator.R file which is done in this line. I'm not sure why this was done but we should move to CSV files.

re input validation, definitely. Ideally we would have a function that validates the sheets.

Each scenario: I meant something along the lines of (1) if someone wants to use their own variables sheet, they should import a CSV file into R using a function like readr::read_csv() or (2) if someone wants to use a variables sheet that's part of a package, they should first install and then load that package. But ignore what I said earlier since the variables sheet is a data.frame in both scenarios.
Instead of sourcing test-data-generator.R in the vignette, we could include the cleaned data as part of the package: https://r-pkgs.org/data.html#sec-data-data-raw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for improving the vignettes #35

Suggestions for improving the vignettes #35

reikookamoto commented Nov 6, 2023 •

edited

Loading

reikookamoto commented Nov 6, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 8, 2023 •

edited

Loading

DougManuel commented Nov 9, 2023

yulric commented Nov 14, 2023

Questions that crossed my mind

yulric commented Nov 14, 2023

how_to_install.Rmd

yulric commented Nov 14, 2023

variables_sheet.Rmd

yulric commented Nov 14, 2023

variable_details.Rmd

yulric commented Nov 14, 2023 •

edited

Loading

missing_data_(tagged_na).Rmd

yulric commented Nov 14, 2023

how_to_recode.Rmd

reikookamoto commented Nov 15, 2023

Questions that crossed my mind

reikookamoto commented Nov 15, 2023

how_to_install.Rmd

reikookamoto commented Nov 15, 2023

variables_sheet.Rmd

Suggestions for improving the vignettes #35

Suggestions for improving the vignettes #35

Comments

reikookamoto commented Nov 6, 2023 • edited Loading

Questions that crossed my mind

reikookamoto commented Nov 6, 2023 • edited Loading

reikookamoto commented Nov 7, 2023 • edited Loading

reikookamoto commented Nov 7, 2023 • edited Loading

reikookamoto commented Nov 7, 2023 • edited Loading

reikookamoto commented Nov 7, 2023

reikookamoto commented Nov 7, 2023 • edited Loading

reikookamoto commented Nov 8, 2023 • edited Loading

DougManuel commented Nov 9, 2023

yulric commented Nov 14, 2023

Questions that crossed my mind

yulric commented Nov 14, 2023

yulric commented Nov 14, 2023

yulric commented Nov 14, 2023

yulric commented Nov 14, 2023 • edited Loading

yulric commented Nov 14, 2023

reikookamoto commented Nov 15, 2023

Questions that crossed my mind

reikookamoto commented Nov 15, 2023

reikookamoto commented Nov 15, 2023

reikookamoto commented Nov 6, 2023 •

edited

Loading

reikookamoto commented Nov 6, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 7, 2023 •

edited

Loading

reikookamoto commented Nov 8, 2023 •

edited

Loading

yulric commented Nov 14, 2023 •

edited

Loading