Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor convert_input to Perform tasks via helper function #3338

Open
wants to merge 24 commits into
base: develop
Choose a base branch
from

Conversation

Sweetdevil144
Copy link
Contributor

@Sweetdevil144 Sweetdevil144 commented Jul 18, 2024

Description

Following changes were performed :

  • Shift functions to check for missing files
  • Return from convert_input via a helper function
  • Refactor extra variables in run.meta.analysis
  • Update corresponding test files and add tests to ensure do_conversions isn't affected by current applied changes

Motivation and Context

The main motive for these changes is to simplify convert_input by trying to break some of its components and branch it to other functions.

This PR may fix a task within #3307

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Return from convert_input via a helper function

Update corresponding test files and add tests to ensure do_conversions isn't affected by current applied changes

Signed-off-by: Abhinav Pandey <[email protected]>
Signed-off-by: Abhinav Pandey <[email protected]>
Comment on lines 740 to 743
result_sizes <- checked.missing.files$result_sizes;
outlist <- checked.missing.files$outlist;
existing.input <- checked.missing.files$existing.input;
existing.dbfile <- checked.missing.files$existing.dbfile;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is their any other way to unwrap these variables in a single context line. For example, inheriting the above list and utilising the list$var syntax in-place. A lot of variable operations(assignment and reassignment) would consume extra memory in the heap. Should we also implement a Garbage Collection (gc()) or is it automatically applied in R ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than try to unwrap them in one line I'd consider whether they need to be assigned separate names at all -- e.g. can all places below that refer to result_sizes refer to checked.missing.files$result_sizes instead? If that's what you're suggesting by "utilising the list$var syntax in-place", then yes, I like that approach.

That said these objects are all very small and the runtime/memory overhead from reassignment will be negligible and R's reference-counting garbage collection will take care of the cleanup for us. Happily though, the same optimization gives (what I think is) a substantial improvement in code clarity, so I endorse it on those grounds.

@Sweetdevil144

This comment was marked as outdated.

#' @return list of machine, input, and dbfile records
#' @author Betsy Cowdery, Michael Dietze, Ankur Desai, Tony Gardella, Luke Dramko

get.machine.info <- function(host, input.args, input.id = NULL, con = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use underscore instead of dot in the name here to avoid possible confusion with S3 methods. These days I only use dot where it's needed for consistency with existing code, and the db package is already inconsistent about it 😉

Suggested change
get.machine.info <- function(host, input.args, input.id = NULL, con = NULL) {
get_machine_info <- function(host, input.args, input.id = NULL, con = NULL) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the usages below to match, naturally

#' @param con database connection
#' @return list of machine host and machine information
#' @author Abhinav Pandey
get.machine.host <- function(host, con = NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. rename as discussed for get.machine.info above
  2. db.query will fail if con = NULL, might as well make it mandatory by providing no default value
Suggested change
get.machine.host <- function(host, con = NULL) {
get_machine_host <- function(host, con) {

return(NULL)
}

if (missing(input.id) || is.na(input.id) || is.null(input.id)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the missing() here because get.machine.info now defaults it to NULL

Suggested change
if (missing(input.id) || is.na(input.id) || is.null(input.id)) {
if (is.na(input.id) || is.null(input.id)) {


if (nrow(machine) == 0) {
PEcAn.logger::logger.error("machine not found", host$name)
return(NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that returning from get.machine.info could have different behavior than returning from convert.input. I'll need to read to the end to know whether that difference matters, but mentioning it now before I forget

Comment on lines +13 to +14
# Print the structure of `res` for debugging
str(res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Print the structure of `res` for debugging
str(res)

# This is to tell input.insert if we are writing ensembles
# Why does it need it? Because it checks for inputs with the same time period, site, and machine
# and if it returns something it does not insert anymore, but for ensembles, it needs to bypass this condition
ens.flag <- if (!is.null(ensemble) | is.null(ensemble_name)) TRUE else FALSE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ens.flag <- if (!is.null(ensemble) | is.null(ensemble_name)) TRUE else FALSE
ens.flag <- (!is.null(ensemble) || is.null(ensemble_name))

}
} # End for loop

successful <- TRUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like successful will need to be passed back to convert.input somehow -- it uses it to determine what files to clean up when it exits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a quick skim it looks like none of the cases that can set successful = FALSE occur inside add.database.entries. Maybe a possible solution is to remove them here and have convert.input set successful after it calls add.database.entries and before returning?

Comment on lines +3 to +4
mocked_res <- mockery::mock(data.frame(file = c("A", "B"), file_size = c(100, 200), missing = c(FALSE, FALSE), empty = c(FALSE, FALSE)))
mockery::stub(check_missing_files, "purrr::map_dfr", mocked_res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what happens inside map_dfr is part of what we want to be testing here -- might be better to stub out file.size since that's the only place that would be looking at the filesystem.

Suggested change
mocked_res <- mockery::mock(data.frame(file = c("A", "B"), file_size = c(100, 200), missing = c(FALSE, FALSE), empty = c(FALSE, FALSE)))
mockery::stub(check_missing_files, "purrr::map_dfr", mocked_res)
mocked_size <- mockery::mock(100,200)
mockery::stub(check_missing_files, "file.size", mocked_res)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be good to include cases with missing and empty files and confirm that it produces the expected error

@@ -0,0 +1,19 @@
test_that("`check_missing_files()` able to return correct missing files", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filename should match function name (so use _ not .)
Come to think of it, same for the files containing the function definitions

existing.input = list(data.frame(file = character(0))),
existing.dbfile = list(data.frame(file = character(0)))
))
mockery::stub(convert_input, "add.database.entries", list(input.id = 1, dbfile.id = 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a lot of mocks and stubs, but I don't see any that aren't needed. Hopefully as we break up the function further we'll be able to simplify these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants