Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review #126

Merged
merged 59 commits into from
May 31, 2024
Merged

Review #126

merged 59 commits into from
May 31, 2024

Conversation

Karim-Mane
Copy link
Member

This is a PR for a full package review of {cleanepi}

@Karim-Mane Karim-Mane self-assigned this Apr 2, 2024
Copy link

github-actions bot commented Apr 2, 2024

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

@Bisaloo Bisaloo self-requested a review April 2, 2024 16:15
Copy link

github-actions bot commented Apr 5, 2024

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this. I like the reporting functionality, and I think it can provide a unique value compared to other data cleaning packages.

I've left quite a lot of comments. Please focus on the ones regarding the user interface before the CRAN release.

Two general comments applicable throughout the codebase:

  • when possible, please try to address the edge cases as part of the general cases. Otherwise, if every edge case if special cased, the code becomes very long and difficult to follow.
  • in general, user input should be properly formatted as standard R objects. We don't want to have to clean and parse user input on top of already messy data.

DESCRIPTION Outdated Show resolved Hide resolved
R/check_date_sequence.R Outdated Show resolved Hide resolved
R/check_date_sequence.R Outdated Show resolved Hide resolved
R/check_date_sequence.R Outdated Show resolved Hide resolved
R/clean_data.R Outdated Show resolved Hide resolved
R/find_and_remove_duplicates.R Outdated Show resolved Hide resolved
R/span.R Outdated
Comment on lines 74 to 93
# end_date can be a column of the input data or
# a vector of Date values with the same length as number of row in data or
# a Date value
if (is.character(end_date) && end_date %in% colnames(data)) {
span_result <- abs(unclass(data[[target_column]]) -
unclass(data[[end_date]]))
} else {
span_result <- abs(unclass(data[[target_column]]) - unclass(end_date))
}
units <- c(365.25, 30.0, 7.0, 1.0)
names(units) <- c("years", "months", "weeks", "days")
if (!is.null(span_remainder_unit)) {
data[, span_column_name] <- floor(span_result / units[span_unit])
data[, sprintf("remainder_%s", span_remainder_unit)] <- round(
(span_result %% units[span_unit]) / units[span_remainder_unit],
digits = 2L)
} else {
data[, span_column_name] <- round(span_result / units[span_unit],
digits = 2L)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe lubridate or base R have good functionality to deal with date differences. Any reasons to not use these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was making use of lubridate functions, which @pratikunterwegs polished furhter.

We were at some point doing a proof of concept about reducing dependencies by making the function with only base R. This function happened to be the test function and we managed to archieve the same as when lubridate was used. So we decided to keep this version.

Copy link
Member

@chartgerink chartgerink Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are including lubridate anyway (as it currently stands), keeping the custom function because it work does not seem like a convincing reason to include it.

A custom function also increases maintenance complexity. This custom function is not as well validated and tested as lubridate, which risks us having to deal with bugs and edge cases down the line.

I would recommend tracking this further in an issue, so that we don't lose track of it and a decision on this does not block the package review process. As a nice side benefit: using lubridate directly could also resolve #134 😊

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have restored the usage of {lubridate} functionalities in this function and changed the function name from span() to timespan(). See commit 1a448d1.

R/standardize_subject_ids.R Outdated Show resolved Hide resolved
R/standardize_subject_ids.R Outdated Show resolved Hide resolved
R/convert_to_numeric.R Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Apr 8, 2024

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented Apr 9, 2024

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 77 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link
Member

@chartgerink chartgerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Karim-Mane - the most critical issue has been resolved from my end (creating and deleting tmp folders). The dependencies are something that can (and in my opinion, should) be revisited at a later time.

DESCRIPTION Outdated Show resolved Hide resolved
Copy link

This pull request:

  • Adds 63 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented May 6, 2024

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented May 7, 2024

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented May 7, 2024

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented May 8, 2024

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

github-actions bot commented May 8, 2024

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

@Karim-Mane
Copy link
Member Author

@Bisaloo, @chartgerink - I have made some changes to account for your latest reviews. Kindly have a look and let me know your thoughts. I am aiming to close the PR on Wednesday latest.

@Bisaloo
Copy link
Member

Bisaloo commented May 21, 2024

@Bisaloo, @chartgerink - I have made some changes to account for your latest reviews. Kindly have a look and let me know your thoughts. I am aiming to close the PR on Wednesday latest.

Restarting conversation thread as it's getting lost in the noise otherwise:

  • @chartgerink - I have provided two examples on the usage of the remove argument in the find and remove duplicates section of the package vignette.

I still don't follow. The example in the vignette reads:

no_dups <- remove_duplicates(
  data           = readRDS(system.file("extdata", "test_linelist.RDS",
                                       package = "cleanepi")),
  target_columns = "linelist_tags",
  remove         = -c(33, 55)
)

But if I already know that rows 33 and 55 are to be removed, why would I use remove_duplicates() rather than slice() or any other existing method directly?

@Karim-Mane
Copy link
Member Author

Thanks @Bisaloo for pointing these out.

  • On the first point: I must admit that for simplicity reasons, one could use:
    • dplyr::slice() or,
    • dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

  • On the second point: pasting here what I DMed @chartgerink on slack (with a bit more details)

After looking into the date_guess() function, followings are my findings and recommendation:

  1. The function is useful for dealing with date columns where values can be associated with multiple formats.
  2. It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.
  3. When more than 1 format suits some values, the function picks the first given format from the orders list.
  4. The rest of the code makes sure to account for values that {lubridate} could not handle:
    * convert numbers to date depending on the specified version of excel,
    * check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
    lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

I have added some code to output rows where the date values comply with multiple formats. This information is returned as a data frame and will be shown in the report to help the user decide whether to confirm or amend the result.

@Bisaloo
Copy link
Member

Bisaloo commented May 22, 2024

  • On the first point: I must admit that for simplicity reasons, one could use:
    • dplyr::slice() or,
    • dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

No, it still doesn't really make sense to me. I understand and I agree find_duplicates() & remove_duplicates() provide useful functionality. But the remove argument mixes several distinct tasks in this function.
If users provide indices to be removed, it's no longer a remove_duplicates() function, it's a remove() (or actually slice() function).
This argument & unrelated functionality makes it much more difficult to understand what this function purpose is and I believe it should be removed.

After looking into the date_guess() function, followings are my findings and recommendation:

  1. The function is useful for dealing with date columns where values can be associated with multiple formats.
  2. It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.
  3. When more than 1 format suits some values, the function picks the first given format from the orders list.
  4. The rest of the code makes sure to account for values that {lubridate} could not handle:
    * convert numbers to date depending on the specified version of excel,
    * check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
    lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

There are many other packages with solid functionality for just this (e.g. the anytime R package). In order to not get stuck on this discussion and delay the release, let's please:

  • open a dedicated issue
  • keep date_guess() for now
  • reconsider for the next release

@Karim-Mane
Copy link
Member Author

  • On the first point: I must admit that for simplicity reasons, one could use:

    • dplyr::slice() or,
    • dat[-c(idx1, idx2, ...), ]. This syntax is used in the remove_duplicates().

Without this argument, the remove_duplicates() function will always keep the first instance of a duplicated group of rows, while deleting the others.
Hence the most important function of this module will be find_duplicates(). Users can choose the method to use for deleting the not needed rows at their own convinience. Does this make sense?

No, it still doesn't really make sense to me. I understand and I agree find_duplicates() & remove_duplicates() provide useful functionality. But the remove argument mixes several distinct tasks in this function. If users provide indices to be removed, it's no longer a remove_duplicates() function, it's a remove() (or actually slice() function). This argument & unrelated functionality makes it much more difficult to understand what this function purpose is and I believe it should be removed.

I guess what I was trying to say was that I will delete this remove parameter, and add to the documentation that once a user uses find_duplicates(), the identified duplicates can be removed using dat[-c(idx1, idx2, ..., idxN), ] or dplyr::slice().

After looking into the date_guess() function, followings are my findings and recommendation:

  1. The function is useful for dealing with date columns where values can be associated with multiple formats.
  2. It is mainly based on lubridate::parse_date_time() to subject the date values to the proposed formats.
  3. When more than 1 format suits some values, the function picks the first given format from the orders list.
  4. The rest of the code makes sure to account for values that {lubridate} could not handle:
    • convert numbers to date depending on the specified version of excel,
    • check if characters comply with the following formats: "%Y-%m-%d", "%d-%m-%Y", "%d-%b-%Y", "%Y-%b-%d".
      lubridate::parse_date_time() would have converted such values if any of the specified format in the orders argument correspond to them. This is only relevant if the actual format of a date value is not part of the proposed formats in the orders argument, in which case this extra step maximises on the chance of converting all values in a column.

I suggest to keep the function and continue using it. We could delete it and write a new function. But the code base in that new function will not be too different (if it is) from what is there currently.

There are many other packages with solid functionality for just this (e.g. the anytime R package). In order to not get stuck on this discussion and delay the release, let's please:

* open a dedicated issue

Issue #133 can be used for this

* keep `date_guess()` for now

* reconsider for the next release

Sounds good with me.

Copy link

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

Copy link

This pull request:

  • Adds 62 new dependencies (direct and indirect)
  • Adds 13 new system dependencies
  • Removes 6 existing dependencies (direct and indirect)
  • Removes 3 existing system dependencies

(Note that results may be inacurrate if you branched from an outdated version of the target branch.)

@Karim-Mane Karim-Mane changed the base branch from empty to main May 29, 2024 13:20
@Karim-Mane Karim-Mane merged commit 8c87a73 into main May 31, 2024
8 checks passed
@Karim-Mane Karim-Mane deleted the review branch May 31, 2024 10:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants