Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config reader with expected schema validation #7

Closed
wants to merge 19 commits into from

Conversation

zsusswein
Copy link
Collaborator

@zsusswein zsusswein commented Aug 7, 2024

Note

Edit 2024-09-12: Marking as draft while reworking. Will re-open when ready for further review.


Read in the Rt run config (this is meant to be task-specific), validate it against a specified schema, and throw an error if anything doesn't match. As part of the error, dump a description of what's wrong with the config.

Note that this process assumes that all keys are specified. We don't, for example, have a default prior not in the config.

Please take a look at the proposed sample config in tests/testthat/data/sample_config.json! I tried to make it both opinionated and flexible. There are some particular choices in there, like specifying both job and task IDs in the config as UUIDs. Maybe that's too stringent and it would be good to catch that now.

@zsusswein zsusswein force-pushed the zs-read-config branch 3 times, most recently from 0e566c7 to 196d7e2 Compare August 7, 2024 21:53
@zsusswein zsusswein marked this pull request as ready for review August 8, 2024 12:58
@zsusswein
Copy link
Collaborator Author

There's one NOTE in R CMD check but it's because the azure download is specified in #7 and I don't include it here.

@zsusswein
Copy link
Collaborator Author

@kaitejohnson feedback on extensibility of this setup for WW modeling would be great!

@zsusswein zsusswein requested a review from kaitejohnson August 8, 2024 13:13
Copy link
Collaborator

@natemcintosh natemcintosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions. Overall, I think having the schema like this is going to make our lives so much simpler. We just have to make sure we always use the config values throughout the Rt code now 😅

R/config.R Show resolved Hide resolved
inst/extdata/config_schema.json Show resolved Hide resolved
inst/extdata/config_schema.json Outdated Show resolved Hide resolved
tests/testthat/data/sample_config.json Show resolved Hide resolved
tests/testthat/data/sample_config.json Show resolved Hide resolved
"format": "date"
}
},
"reference_date": {
Copy link
Collaborator

@natemcintosh natemcintosh Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each report date runs on all the reference dates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm -- yeah this is a good flag. That's a bad assumption. Let me revisit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be, for at least the EpiNow2 example, this vector of dates corresponding to the time series data passed in? E.g the date of admissions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for EpiNow2, you would just have a single report date? So for example for this week if I ran EpiNow2 today, assuming old NHSN data reporting. I'd have:
as_of_date = "2024-08-08",
report_date = "2024-08-07",
reference_date = a vector of dates going back some specified calibration period up until "2024-08-02" (last friday)

inst/extdata/config_schema.json Show resolved Hide resolved
"additionalProperties": false,
"properties": {
"mean": {
"type": "integer"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why integer instead of number?

"type": "object",
"additionalProperties": false,
"properties": {
"job_id": {
Copy link
Collaborator

@natemcintosh natemcintosh Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only slight hesitation with UUIDs for the job IDs is if we run multiple jobs, it would just be a bit harder to know which job is which. That said, it would mean we never run into the annoying error "This job already exists" because we forgot to delete it.

What about, e.g. Rt-estimation-2024-08-08T10:08:34 as job name?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that you will pass in a UUID that is generated somewhere else right?

Probably not for this PR, I would add more metadata. For example, if this job id is under the "EpiNow2" umbrella, I would want something that names the job based on the name of the package being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @natemcintosh's suggestion as long as we (1) store the date timestamp inside the metadata, not just in the path name, and (2) as long as there are no special character concerns using this as a path name 😬

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing out this naming scheme idea on something else, and discovered that Azure was not happy with :, so I replaced it with -.

So this might be something more like Rt-estimation-2024-08-08T10-08-34

inst/extdata/config_schema.json Show resolved Hide resolved
Copy link

@kaitejohnson kaitejohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably asking some obvious questions... feel free to ignore if they are obviously googleable.

My main question is about the flexibiltiy of the format of the config. E.g. Does it need to take in a separate arg for Priors, and do those Priors have to have very specific args of rt and Gp?

How would multiple data paths be specified in this workflow?

R/config.R Outdated Show resolved Hide resolved
#' `blob_storage_container` is specified, the the path is assumed to be within
#' the specified container otherwise it is assumed to be in the local
#' filesystem.
#' @param local_dest The local directory to write the config to when downloading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if local_dest doesn't exist?

#' The validation relies on `inst/data/config_schema.json` for validation. This
#' file is in `json-schema` notation and generated programatically via
#' https://www.jsonschema.net/.
#'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment, I think these are all character strings but I find it helpful when reading documentation when the type is explicitly specified.

"type": "object",
"additionalProperties": false,
"properties": {
"job_id": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that you will pass in a UUID that is generated somewhere else right?

Probably not for this PR, I would add more metadata. For example, if this job id is under the "EpiNow2" umbrella, I would want something that names the job based on the name of the package being used.

inst/extdata/config_schema.json Show resolved Hide resolved
"seed",
"task_id"
],
"title": "Epinow2"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand, currently this example is for "Epinow2" but you want to be able to swap this for another package name right?

"format": "date"
}
},
"reference_date": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be, for at least the EpiNow2 example, this vector of dates corresponding to the time series data passed in? E.g the date of admissions

"format": "date"
}
},
"reference_date": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for EpiNow2, you would just have a single report date? So for example for this week if I ran EpiNow2 today, assuming old NHSN data reporting. I'd have:
as_of_date = "2024-08-08",
report_date = "2024-08-07",
reference_date = a vector of dates going back some specified calibration period up until "2024-08-02" (last friday)

],
"title": "Parameters"
},
"Priors": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you always needs to pass in these arguments or is this specific to Epinow2? I think there might be other packages where you would handle specifying priors differently (e.g. in the ww package we're developing, priors are lumped in with parameters...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that you'd have to edit this in another version of the repo to enforce that schema, but I think the framework is adaptable!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that something like fetch_config is not EpiNow2 specific there is some argument it shouldn't be in this package. It seems like if we are to have another package cfa-newpackage-pipeline then it'll also need things function. So the schema is EpiNow2 specific but the surrounding functions are not

"blob_storage_container": null
},
"data": {
"path": "gold/",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, how would you point to multiple data sources when it seems that data just has one path option

Copy link
Collaborator

@kgostic kgostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good but I have some big picture questions and comments:

  1. As noted within, I think there are a few runtime pars we want to add to the schema
  2. Will the output file path be specified elsewhere? Did I miss this?
  3. How will this be applied across jurisdictions? Is this meant to be a stem that we can later append the jurisdiction ID to? One use case that isn't covered here is the ability to change a parameter for just one jurisdiction.
  4. Are you envisioning that config specification and validation happen before kicking off the Batch job or inside Azure?
  5. I think this should be out of scope for now, but I'm wondering if it's worth adding at some point "optional_validators" to the schema -- this would be a list of function names that could be turned on in the config, e.g. "validate_for_production" could check that we're using the right days of week, etc. but also could be easily turned off.
  6. I think we need to start writing documentation as we merge PRs or we're never going to be able to remember ourselves how things work, let alone enforce adherence to the existing standards and architecture and avoid this disintegrating back into spaghetti code. At minimum, I think for the config we need:
  • a "data dictionary" explaining what each parameter means
  • some guidance for how to build, store, edit, and pass a config file into a run
  • some guidance on how to change the config schema (edit the example files, the tests, the config_schema.json)

"type": "object",
"additionalProperties": false,
"properties": {
"job_id": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @natemcintosh's suggestion as long as we (1) store the date timestamp inside the metadata, not just in the path name, and (2) as long as there are no special character concerns using this as a path name 😬

}
},
"required": [
"as_of_date",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"as_of_date",
"as_of_date",
"timeseries_end_date",
"timeseries_length_weeks",

These are needed so that we can do things like:

  • Change the number of weeks in the sliding window
  • Kick off retrospective runs (e.g. where the as_of_date is much later than the timeseries_end_date)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logical relationship between as_of_date here, and report_date below inside the data block? Do we need to enforce / validate this relationship somewhere?

},
"required": [
"as_of_date",
"data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is explained later, but what does "data" mean?

],
"title": "Parameters"
},
"Priors": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that you'd have to edit this in another version of the repo to enforce that schema, but I think the framework is adaptable!

"type": "string",
"format": "uuid"
},
"as_of_date": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

],
"title": "Data"
},
"Parameters": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to point to more than one container to pull in parameters?

"alpha_sd"
],
"title": "Gp"
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to add specification of the ls mean via gp_opts() https://epiforecasts.io/EpiNow2/reference/gp_opts.html

"SamplerOpts": {
"type": "object",
"additionalProperties": false,
"properties": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the number of samples run here as a parameter -- seems like something we could want to change at runtime, and I think helpful to encode it explicitly instead of using the defaults

expect_equal(actual, expected)
})

test_that("Bad config errors", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use more descriptive test name?

)
})

test_that("Test config validates", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: use more descriptive test name?

R/config.R Outdated Show resolved Hide resolved
R/config.R Outdated Show resolved Hide resolved
#' `blob_storage_container` is specified), reads the config in from the
#' filesystem, and validates that it matches expectations. If any of these steps
#' fails, the pipeline fails with an informative error message. Note, however,
#' that a failure in this initial step suggests that something fundamental is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fundamental" is quite vague

@athowes
Copy link
Collaborator

athowes commented Aug 16, 2024

I think I might be getting at similar things as Katie but:

  • Can we do documentation of extdata using standard R package approaches?
    • On that topic, is there any way to create automatic human readable versions schema like inst/extdata/config_schema.json?
  • As I put in a comment, some of these functions seem to be non-EpiNow2 specific

@zsusswein zsusswein marked this pull request as draft September 12, 2024 23:13
@zsusswein zsusswein mentioned this pull request Oct 2, 2024
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Kaitlyn Johnson <[email protected]>
Co-authored-by: Nate McIntosh <[email protected]>
zsusswein and others added 2 commits October 15, 2024 17:51
* Fix read_data incomplete return checks

* Update @damonabayer's patch to pass CI

Really 3 changes:

1. Fix the lint error in the creation of `missing_dates`
2. Reformat `missing_dates` to a string so that the missing dates are
  pretty-printed. They're printed as an int if they're left as dates.
3. Update the test for this warning to a classed **snapshot** test.
  This change should prevent regression of the warning message.

* Bump NEWS

---------

Co-authored-by: Damon Bayer <[email protected]>
* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/lorenzwalthert/precommit: v0.4.3 → v0.4.3.9001](lorenzwalthert/precommit@v0.4.3...v0.4.3.9001)
- [github.com/astral-sh/ruff-pre-commit: v0.6.4 → v0.6.5](astral-sh/ruff-pre-commit@v0.6.4...v0.6.5)

* Bump NEWS

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Zachary Susswein <[email protected]>
gvegayon and others added 12 commits October 15, 2024 17:52
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.6.5 → v0.6.7](astral-sh/ruff-pre-commit@v0.6.5...v0.6.7)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Explicit date casting in queries

To fix test failures caused by DuckDB v1.1.1 release

* Bump NEWS
* Changes for CFA Azure ACR

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added to news.md

* testing image on jk branch

* update workflow name; resubmit job

* change runs-on to new cdcgov runner

* removed unworking cache check

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* azure batch scaffolding (maybe not necessary here)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* another attempt at cacheing and simplification; splitting the buiild/dependencies workflows

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* better github actions ux via build names

* organized workflows for contributor/tester ux

* removed cacheing and made names easier to read

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comprehensive workflow renaming for ux/trackability

* Explicit date casting in queries

To fix test failures caused by DuckDB v1.1.1 release

* Bump NEWS

* pipeline with batch code - not yet working

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* commenting out the job01 dependencies build to claw us back some test cycle time

* nektos gh-act tests and pool creation code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix some env variables for auto scale formula

* more cowbell

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix needs issue

* we need quotes around var names"

* autoscale formula as cat'd variable

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* newline syntax fix for bash batch cli code

* fixed endpoint uri

* autoscale enablement?;

* autoscale as a separate step

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* pool id var needs fixing in the last step

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* environment variables

* quotes?

* comments

* run name in quotes

* Update .github/workflows/1-Build-Dependency-Image.yaml

Co-authored-by: Nate McIntosh <[email protected]>

* Update .gitignore

Co-authored-by: Zachary Susswein <[email protected]>

* added cron trigger and removed jk-azure-readiness push trigger

* documentation edits; file renames; revived "cacheing" for testing

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* ubuntu image update?

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* node agent sku also...

* attempting ubuntu 22 as 24 not yet supported

* simplified commit message display in workflow gui

* reverted to ubuntu 20. will have to investigate

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Zachary Susswein <[email protected]>
Co-authored-by: Nate McIntosh <[email protected]>
Co-authored-by: Zachary Susswein <[email protected]>
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.6.7 → v0.6.8](astral-sh/ruff-pre-commit@v0.6.7...v0.6.8)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Drop unused hooks

From the template repo that are for non-R languages.

Closes #57

* Bump NEWS
* Moved workflow 1 into workflow 2 and renamed workflow 2 as workflow 1 :)

* Forgot to add the workflow

* Making pre-commit happy

* Adding tag as key for cache

* Right string comparison

* Right string comparison v2

* Adding build args

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Trying to pass arguments to docker build

* Using the ref to id if run in main

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Additional comments by @zsusswein

* Now adding 1-3 into 1 (workflows)

* Update .github/workflows/1_pre-Test-Model-Image-Build.yaml

Co-authored-by: Zachary Susswein <[email protected]>

* Update .github/workflows/1_pre-Test-Model-Image-Build.yaml

Co-authored-by: Zachary Susswein <[email protected]>

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Zachary Susswein <[email protected]>
updates:
- [github.com/pre-commit/pre-commit-hooks: v4.6.0 → v5.0.0](pre-commit/pre-commit-hooks@v4.6.0...v5.0.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Ignore CI/CD stuff in Rbuildignore

* Extract diagnostics from fitted model

* Basic output schema

* Use `.pre-commit.config.yaml` from main

To fix weirdness with unicode parsing error from.....somewhere?

* Update output schema

* Bump NEWS

* Bump NEWS

* Expand on readme

* Use setequal for column name checks

h/t @natemcintosh

* Apply suggestions from code review

Co-authored-by: Adam Howes <[email protected]>

* Update with Adam's review

* Update R/write_output.R

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update R/extract_diagnostics.R

Co-authored-by: Katie Gostic (she/her) <[email protected]>

* Add alert with dates for low case count diagnostic

* Apply suggestions from code review

Co-authored-by: Adam Howes <[email protected]>

* Use new R-universe Stan repository

* Update README.md

Co-authored-by: Katie Gostic (she/her) <[email protected]>

* Update README.md

Co-authored-by: Katie Gostic (she/her) <[email protected]>

* Condense dir creation

* Expose quantiles for summarization

* Save the description of the different EpiNow2 params

* Add comment explaining why dates work

* Clarify comment on EpiNow2 param outputs

* Add `reports` to output

---------

Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Katie Gostic (she/her) <[email protected]>
* Update key `iter_samples` to `iter_sampling`

It was already documented as `iter_sampling` in the docs, but the code
expected `iter_samples`.

`iter_sampling` follows `{cmdstanr}` syntax: https://mc-stan.org/cmdstanr/reference/model-method-sample.html

`iter_samples` is an accidental portmanteau of EpiNow2's desired
`samples` arg and `iter_sampling`.

Closes #73

* Bump NEWS
* Fix NOTE from unassigned variable

Additional NSE problems.

Closes #75

* Bump NEWS
@zsusswein
Copy link
Collaborator Author

Superseded by #99

@zsusswein zsusswein closed this Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants