Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap config to a class with static fields #98

Closed
4 of 6 tasks
natemcintosh opened this issue Nov 21, 2024 · 0 comments · Fixed by #99
Closed
4 of 6 tasks

Swap config to a class with static fields #98

natemcintosh opened this issue Nov 21, 2024 · 0 comments · Fixed by #99
Assignees

Comments

@natemcintosh
Copy link
Collaborator

natemcintosh commented Nov 21, 2024

Goal

Create a Config class with static fields for all the required input parameters for the model.

Context

Configuration files for this pipeline are generated by this very handy config generator. However, because it is in a different repo, differences in the required fields between the repos can sneakily make their way in, only to be found at run time. From personal experience doing a test run of the pipeline, I found two fields that had gotten out of sync, and had to manually fix them in the config files.

At the moment, all of the requried parameters that go in the input configuration JSON file can be found by reading through the code, and seeing what fields are accessed from the config list. For example, grepping through the repo for config[[:

~/dev/cfa-epinow2-pipeline> rg --fixed-strings --glob "**/*.R" "config[["
R/pipeline.R
80:    job_id = config[["job_id"]],
81:    task_id = config[["task_id"]]
87:    config[["job_id"]],
89:    config[["task_id"]]
106:  cli::cli_alert_info("Using job id {.field {config[['job_id']]}}")
107:  cli::cli_alert_info("Using task id {.field {config[['task_id']]}}")
147:    data_path = config[["data"]][["path"]],
... (many more lines cut out)

tests/testthat/test-pipeline.R
40:  job_path <- file.path(output_dir, config[["job_id"]])
41:  task_path <- file.path(job_path, "tasks", config[["task_id"]])
50:          config[["task_id"]],
... (many more lines cut out)

This works, but could be easier.

Since this pipeline is the "source of truth" for all the input parameters, I suggest creating a Config class, where all of the fields are explicitly defined, all in one place. This would make it much simpler to keep both repos in sync, ideally moving config sync issues to dev time vs. run time.

The goal of this is not to lock down the config generator repo into this class only. In the very near future, we hope to have other models doing Rt estimation for us, and the symmetric difference of their parameters will probably not be the empty set. The goal is only to make the job ensuring the two are synced up simpler, and remove run time errors.

Requirements

  • A Config class (maybe R6 or S7?) with all of the fields defined in one place.
  • A function to convert the raw list data from a JSON file into a Config instance.
  • Use the resulting Config instance throughout the code.

Nice to haves

  • Method for serialization of an instance into JSON.
  • Method for creating a JSON schema based on the class definition. Would make syncing the repos even simpler.
  • Automatic validation of used parameters, letting in unused parameters that might be necessary for other models. Perhaps also warning the user that that those parameters are not being used.

Out of scope

Anything outside of replacing the current config list that is used.

Related

  • This conversation about where validation should happen. Quorum was reached that it should happen outside this repo default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant