Review #126

Karim-Mane · 2024-04-02T14:30:17Z

This is a PR for a full package review of {cleanepi}

github-actions · 2024-04-02T14:32:06Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-04-05T01:51:13Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Bisaloo

Thanks for your work on this. I like the reporting functionality, and I think it can provide a unique value compared to other data cleaning packages.

I've left quite a lot of comments. Please focus on the ones regarding the user interface before the CRAN release.

Two general comments applicable throughout the codebase:

when possible, please try to address the edge cases as part of the general cases. Otherwise, if every edge case if special cased, the code becomes very long and difficult to follow.
in general, user input should be properly formatted as standard R objects. We don't want to have to clean and parse user input on top of already messy data.

DESCRIPTION

R/check_date_sequence.R

R/clean_data.R

R/find_and_remove_duplicates.R

Bisaloo · 2024-04-08T15:34:57Z

R/span.R

+ # end_date can be a column of the input data or
+ # a vector of Date values with the same length as number of row in data or
+ # a Date value
+ if (is.character(end_date) && end_date %in% colnames(data)) {
+ span_result <- abs(unclass(data[[target_column]]) -
+ unclass(data[[end_date]]))
+ } else {
+ span_result <- abs(unclass(data[[target_column]]) - unclass(end_date))
+ }
+ units <- c(365.25, 30.0, 7.0, 1.0)
+ names(units) <- c("years", "months", "weeks", "days")
+ if (!is.null(span_remainder_unit)) {
+ data[, span_column_name] <- floor(span_result / units[span_unit])
+ data[, sprintf("remainder_%s", span_remainder_unit)] <- round(
+ (span_result %% units[span_unit]) / units[span_remainder_unit],
+ digits = 2L)
+ } else {
+ data[, span_column_name] <- round(span_result / units[span_unit],
+ digits = 2L)
+ }


I believe lubridate or base R have good functionality to deal with date differences. Any reasons to not use these?

I was making use of lubridate functions, which @pratikunterwegs polished furhter.

We were at some point doing a proof of concept about reducing dependencies by making the function with only base R. This function happened to be the test function and we managed to archieve the same as when lubridate was used. So we decided to keep this version.

If we are including lubridate anyway (as it currently stands), keeping the custom function because it work does not seem like a convincing reason to include it.

A custom function also increases maintenance complexity. This custom function is not as well validated and tested as lubridate, which risks us having to deal with bugs and edge cases down the line.

I would recommend tracking this further in an issue, so that we don't lose track of it and a decision on this does not block the package review process. As a nice side benefit: using lubridate directly could also resolve #134 😊

I have restored the usage of {lubridate} functionalities in this function and changed the function name from span() to timespan(). See commit 1a448d1.

R/standardize_subject_ids.R

R/convert_to_numeric.R

…erformed

github-actions · 2024-04-08T19:18:14Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…quence()

…nce()

github-actions · 2024-04-09T03:04:53Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-04-10T02:58:51Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-04-10T03:18:58Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-04-10T15:17:53Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…g arguments.

github-actions · 2024-04-11T15:29:01Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-04-11T15:38:56Z

This pull request:

Adds 77 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

chartgerink

Thanks @Karim-Mane - the most critical issue has been resolved from my end (creating and deleting tmp folders). The dependencies are something that can (and in my opinion, should) be revisited at a later time.

DESCRIPTION

github-actions · 2024-04-22T09:48:49Z

This pull request:

Adds 63 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…ck_timeframe() function

github-actions · 2024-05-06T16:57:32Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…n() into timespan()

github-actions · 2024-05-07T11:22:40Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…n() into timespan()

github-actions · 2024-05-07T11:34:34Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…mat for every date value.

github-actions · 2024-05-08T11:14:00Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…mat for every date value.

github-actions · 2024-05-08T11:56:50Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Karim-Mane · 2024-05-13T09:29:18Z

@Bisaloo, @chartgerink - I have made some changes to account for your latest reviews. Kindly have a look and let me know your thoughts. I am aiming to close the PR on Wednesday latest.

Bisaloo · 2024-05-21T09:47:19Z

@Bisaloo, @chartgerink - I have made some changes to account for your latest reviews. Kindly have a look and let me know your thoughts. I am aiming to close the PR on Wednesday latest.

Restarting conversation thread as it's getting lost in the noise otherwise:

@chartgerink - I have provided two examples on the usage of the remove argument in the find and remove duplicates section of the package vignette.

I still don't follow. The example in the vignette reads:

no_dups <- remove_duplicates(
  data           = readRDS(system.file("extdata", "test_linelist.RDS",
                                       package = "cleanepi")),
  target_columns = "linelist_tags",
  remove         = -c(33, 55)
)

But if I already know that rows 33 and 55 are to be removed, why would I use remove_duplicates() rather than slice() or any other existing method directly?

As far as I can tell, this has not been addressed: https://github.com/epiverse-trace/cleanepi/pull/126/files#r1575871964

Karim-Mane · 2024-05-21T11:32:53Z

Thanks @Bisaloo for pointing these out.

On the first point: I must admit that for simplicity reasons, one could use:
- dplyr::slice() or,
- dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

On the second point: pasting here what I DMed @chartgerink on slack (with a bit more details)

After looking into the date_guess() function, followings are my findings and recommendation:

The function is useful for dealing with date columns where values can be associated with multiple formats.
It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.
When more than 1 format suits some values, the function picks the first given format from the orders list.
The rest of the code makes sure to account for values that {lubridate} could not handle:
* convert numbers to date depending on the specified version of excel,
* check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

I have added some code to output rows where the date values comply with multiple formats. This information is returned as a data frame and will be shown in the report to help the user decide whether to confirm or amend the result.

Bisaloo · 2024-05-22T13:19:08Z

On the first point: I must admit that for simplicity reasons, one could use:

dplyr::slice() or,

dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

No, it still doesn't really make sense to me. I understand and I agree find_duplicates() & remove_duplicates() provide useful functionality. But the remove argument mixes several distinct tasks in this function.
If users provide indices to be removed, it's no longer a remove_duplicates() function, it's a remove() (or actually slice() function).
This argument & unrelated functionality makes it much more difficult to understand what this function purpose is and I believe it should be removed.

After looking into the date_guess() function, followings are my findings and recommendation:

The function is useful for dealing with date columns where values can be associated with multiple formats.

It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.

When more than 1 format suits some values, the function picks the first given format from the orders list.

The rest of the code makes sure to account for values that {lubridate} could not handle:
* convert numbers to date depending on the specified version of excel,
* check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

There are many other packages with solid functionality for just this (e.g. the anytime R package). In order to not get stuck on this discussion and delay the release, let's please:

open a dedicated issue
keep date_guess() for now
reconsider for the next release

Karim-Mane · 2024-05-22T13:52:19Z

On the first point: I must admit that for simplicity reasons, one could use:

dplyr::slice() or,

dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

No, it still doesn't really make sense to me. I understand and I agree find_duplicates() & remove_duplicates() provide useful functionality. But the remove argument mixes several distinct tasks in this function. If users provide indices to be removed, it's no longer a remove_duplicates() function, it's a remove() (or actually slice() function). This argument & unrelated functionality makes it much more difficult to understand what this function purpose is and I believe it should be removed.

I guess what I was trying to say was that I will delete this remove parameter, and add to the documentation that once a user uses find_duplicates(), the identified duplicates can be removed using dat[-c(idx1, idx2, ..., idxN), ] or dplyr::slice().

After looking into the date_guess() function, followings are my findings and recommendation:

The function is useful for dealing with date columns where values can be associated with multiple formats.

It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.

When more than 1 format suits some values, the function picks the first given format from the orders list.

The rest of the code makes sure to account for values that {lubridate} could not handle:

convert numbers to date depending on the specified version of excel,

check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

There are many other packages with solid functionality for just this (e.g. the anytime R package). In order to not get stuck on this discussion and delay the release, let's please:
* open a dedicated issue

Issue #133 can be used for this

* keep `date_guess()` for now

* reconsider for the next release

Sounds good with me.

github-actions · 2024-05-23T16:40:06Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

github-actions · 2024-05-27T13:14:51Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

…riginal column names

github-actions · 2024-05-29T11:41:16Z

This pull request:

Adds 62 new dependencies (direct and indirect)
Adds 13 new system dependencies
Removes 6 existing dependencies (direct and indirect)
Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Karim-Mane self-assigned this Apr 2, 2024

Bisaloo self-requested a review April 2, 2024 16:15

coerce id column into character

d87e0e3

Bisaloo reviewed Apr 8, 2024

View reviewed changes

Karim-Mane added 2 commits April 8, 2024 16:20

rename some files and functions and use standardize

ed4c52c

allow for using column names before column names standardisation is p…

8e93416

…erformed

actions-user and others added 7 commits April 8, 2024 19:20

Automatic readme update

9ed144f

remove unnecesary line in DESCRIPTION file

a97c504

disable option for removing bad date sequences found by check_date_se…

207d829

…quence()

disable possibility to provide a comma-separated list of column names.

a85ce72

remove the internal call of standardize_dates() from check_date_seque…

47821a3

…nce()

use || instead of |

4e46588

optimize on the code in detect_to_numeric_columns() as suggested by Hugo

0792230

Karim-Mane added 2 commits April 9, 2024 14:03

use '.data$var' from rlang in some dplyr functions

6dafba4

use regex to check prefix and suffix

f7595f8

allow for multiple prefix and suffix

5ad6e62

fix pkgdown issue

4c4aec9

Karim-Mane added 3 commits April 11, 2024 15:24

create function to set the default cleaning operations.

87eb266

update clean_data() documentation and account for the default cleanin…

36917e8

…g arguments.

send warning the missing character is not found.

9f3f819

update documentation for standardize_dates() function

05bd258

chartgerink reviewed Apr 22, 2024

View reviewed changes

DESCRIPTION Outdated Show resolved Hide resolved

use paste()/paste0() instead of glue

225ad05

add arbitrary to documentation of first_date argument in date_che…

d64d09f

…ck_timeframe() function

use {lubridate} functionalities to calculate time span and rename spa…

1a448d1

…n() into timespan()

use {lubridate} functionalities to calculate time span and rename spa…

aef8c53

…n() into timespan()

make sure date_guess() also returns a dataframe with all possible for…

a7ec9d3

…mat for every date value.

make sure date_guess() also returns a dataframe with all possible for…

f9822b9

…mat for every date value.

remove the remove parameter from the find_duplicates() function.

2f64748

update the DESCRIPTION file with funder and copyright information

43a86ba

update the way columns are checked and their names matched with the o…

32447fb

…riginal column names

Bisaloo approved these changes May 29, 2024

View reviewed changes

Karim-Mane changed the base branch from empty to main May 29, 2024 13:20

Karim-Mane requested a review from chartgerink May 29, 2024 13:27

chartgerink approved these changes May 30, 2024

View reviewed changes

Karim-Mane merged commit 8c87a73 into main May 31, 2024
8 checks passed

Karim-Mane deleted the review branch May 31, 2024 10:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review #126

Review #126

Karim-Mane commented Apr 2, 2024

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 5, 2024

Bisaloo left a comment •

edited

Loading

Bisaloo Apr 8, 2024

Karim-Mane Apr 9, 2024

chartgerink Apr 23, 2024 •

edited

Loading

Karim-Mane May 7, 2024

github-actions bot commented Apr 8, 2024

github-actions bot commented Apr 9, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 11, 2024

chartgerink left a comment

github-actions bot commented Apr 22, 2024

github-actions bot commented May 6, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

Karim-Mane commented May 13, 2024

Bisaloo commented May 21, 2024

Karim-Mane commented May 21, 2024

Bisaloo commented May 22, 2024

Karim-Mane commented May 22, 2024

github-actions bot commented May 23, 2024

github-actions bot commented May 27, 2024

github-actions bot commented May 29, 2024

Review #126

Review #126

Conversation

Karim-Mane commented Apr 2, 2024

github-actions bot commented Apr 2, 2024

github-actions bot commented Apr 5, 2024

Bisaloo left a comment • edited Loading

Choose a reason for hiding this comment

Bisaloo Apr 8, 2024

Choose a reason for hiding this comment

Karim-Mane Apr 9, 2024

Choose a reason for hiding this comment

chartgerink Apr 23, 2024 • edited Loading

Choose a reason for hiding this comment

Karim-Mane May 7, 2024

Choose a reason for hiding this comment

github-actions bot commented Apr 8, 2024

github-actions bot commented Apr 9, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 10, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 11, 2024

chartgerink left a comment

Choose a reason for hiding this comment

github-actions bot commented Apr 22, 2024

github-actions bot commented May 6, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 7, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

Karim-Mane commented May 13, 2024

Bisaloo commented May 21, 2024

Karim-Mane commented May 21, 2024

Bisaloo commented May 22, 2024

Karim-Mane commented May 22, 2024

github-actions bot commented May 23, 2024

github-actions bot commented May 27, 2024

github-actions bot commented May 29, 2024

Bisaloo left a comment •

edited

Loading

chartgerink Apr 23, 2024 •

edited

Loading