Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config reader with expected schema validation #7

Closed
wants to merge 19 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Config reader with expected schema validation
zsusswein committed Aug 8, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
commit a7602bb23250462300e3c7969458661aaccfcc67
1 change: 0 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -7,7 +7,6 @@ repos:
hooks:
- id: style-files
args: [--style_pkg=styler, --style_fun=tidyverse_style]
- id: roxygenize
- id: use-tidy-description
- id: lintr
- id: readme-rmd-rendered
7 changes: 6 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -14,8 +14,13 @@ Description: Add logging, metadata handling, and data handling
License: Apache License (>= 2)
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
RoxygenNote: 7.3.2
Suggests:
testthat (>= 3.0.0)
Config/testthat/edition: 3
URL: https://cdcgov.github.io/cfa-epinow2-pipeline/
Imports:
cli,
jsonlite,
jsonvalidate,
rlang
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Generated by roxygen2: do not edit by hand

export(add_two_numbers)
export(fetch_config)
export(validate_config)
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# CFAEpiNow2Pipeline (development version)

* Config reader with schema validation and enforcement
* CI running on Ubuntu only & working pkgdown deploy to Github Pages
* Initial R package with checks running in CI
100 changes: 100 additions & 0 deletions R/config.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
#' Fetch the config from an external resource
#'
#' This step is the first part of the modeling pipeline. It looks to Azure Blob
#' and downloads the Rt model run's config to the local config (if
#' `blob_storage_container` is specified), reads the config in from the
#' filesystem, and validates that it matches expectations. If any of these steps
#' fails, the pipeline fails with an informative error message. Note, however,
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
#' that a failure in this initial step suggests that something fundamental is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fundamental" is quite vague

#' misspecified and the logs will likely not be preserved in a Blob Container if
#' running in Azure.
#'
#' The validation relies on `inst/data/config_schema.json` for validation. This
#' file is in `json-schema` notation and generated programatically via
#' https://www.jsonschema.net/.
#'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment, I think these are all character strings but I find it helpful when reading documentation when the type is explicitly specified.

#' @param config_path The path to the config file, either in the local
#' filesystem or with an Azure Blob Storage container. If
#' `blob_storage_container` is specified, the the path is assumed to be within
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
#' the specified container otherwise it is assumed to be in the local
#' filesystem.
#' @param local_dest The local directory to write the config to when downloading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if local_dest doesn't exist?

#' it from `blob_storage_container`. This argument is ignored unless
#' `blob_storage_container` is specified.
#' @param blob_storage_container The storage container holding the config at
#' `config_path`
#' @param config_schema_path The path to the file holding the schema for the
#' config json for the validator to use.
#'
#' @return A list of lists, the config for the run.
#' @export
fetch_config <- function(
config_path,
local_dest,
blob_storage_container,
config_schema_path = system.file("extdata/config_schema.json",
natemcintosh marked this conversation as resolved.
Show resolved Hide resolved
package = "CFAEpiNow2Pipeline"
)) {
if (!rlang::is_null(blob_storage_container)) {
download_from_azure_blob(
config_path,
local_dest,
container_name = blob_storage_container
)
} else {
cli::cli_alert(
"No blob storage container provided. Reading from local path."
)
}

cli::cli_alert_info("Loading config from {.path {config_path}}")
validate_config(config_path, config_schema_path)

config <- rlang::try_fetch(
jsonlite::read_json(config_path),
error = function(con) {
cli::cli_abort(
"Error loading config from {.path {config_path}}",
parent = con,
class = "CFA_Rt"
)
}
)

return(config)
}

#' Compare loaded json against expectation in `inst/data/config-schema.json`
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
#'
#' @inheritParams fetch_config
#' @return NULL, invisibly
#' @export
validate_config <- function(
config_path,
config_schema_path = system.file("extdata/config_schema.json",
package = "CFAEpiNow2Pipeline"
)) {
is_config_valid <- rlang::try_fetch(
jsonvalidate::json_validate(
json = config_path,
schema = config_schema_path,
engine = "ajv",
verbose = TRUE,
greedy = TRUE,
error = TRUE
),
error = function(con) {
cli::cli_abort(
c(
"Error while validating config",
"!" = "Config path: {.path {config_path}}",
"!" = "Schema path: {.path {config_schema_path}}"
),
parent = con,
class = "CFA_Rt"
)
}
)

invisible(is_config_valid)
}
191 changes: 191 additions & 0 deletions inst/extdata/config_schema.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
{
"$schema": "http://json-schema.org/draft-06/schema#",
"$ref": "#/definitions/Epinow2",
"definitions": {
"Epinow2": {
"type": "object",
"additionalProperties": false,
"properties": {
"job_id": {
Copy link
Collaborator

@natemcintosh natemcintosh Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only slight hesitation with UUIDs for the job IDs is if we run multiple jobs, it would just be a bit harder to know which job is which. That said, it would mean we never run into the annoying error "This job already exists" because we forgot to delete it.

What about, e.g. Rt-estimation-2024-08-08T10:08:34 as job name?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes that you will pass in a UUID that is generated somewhere else right?

Probably not for this PR, I would add more metadata. For example, if this job id is under the "EpiNow2" umbrella, I would want something that names the job based on the name of the package being used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like @natemcintosh's suggestion as long as we (1) store the date timestamp inside the metadata, not just in the path name, and (2) as long as there are no special character concerns using this as a path name 😬

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing out this naming scheme idea on something else, and discovered that Azure was not happy with :, so I replaced it with -.

So this might be something more like Rt-estimation-2024-08-08T10-08-34

"type": "string",
"format": "uuid"
},
"task_id": {
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
"type": "string",
"format": "uuid"
},
"as_of_date": {
natemcintosh marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below

"type": "string",
"format": "date"
},
"disease": {
"type": "string"
},
"geo_value": {
"type": "array",
"items": {
"type": "string"
}
},
"geo_type": {
"type": "string"
},
"parameters": {
"$ref": "#/definitions/Parameters"
},
"data": {
"$ref": "#/definitions/Data"
},
"seed": {
"type": "integer"
},
"horizon": {
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
"type": "integer"
},
"priors": {
"$ref": "#/definitions/Priors"
},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do passed in pmfs go in this work flow? Are they a part of parameters?

"sampler_opts": {
"$ref": "#/definitions/SamplerOpts"
}
},
"required": [
"as_of_date",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"as_of_date",
"as_of_date",
"timeseries_end_date",
"timeseries_length_weeks",

These are needed so that we can do things like:

  • Change the number of weeks in the sliding window
  • Kick off retrospective runs (e.g. where the as_of_date is much later than the timeseries_end_date)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logical relationship between as_of_date here, and report_date below inside the data block? Do we need to enforce / validate this relationship somewhere?

"data",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this is explained later, but what does "data" mean?

"disease",
"geo_type",
"geo_value",
"horizon",
"job_id",
"parameters",
"priors",
"sampler_opts",
"seed",
"task_id"
],
"title": "Epinow2"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure I understand, currently this example is for "Epinow2" but you want to be able to swap this for another package name right?

},
"Data": {
"type": "object",
"additionalProperties": false,
"properties": {
"path": {
"type": "string"
},
"blob_storage_container": {
"type": ["null", "string"]
},
"report_date": {
natemcintosh marked this conversation as resolved.
Show resolved Hide resolved
"type": "array",
"items": {
"type": "string",
"format": "date"
}
},
"reference_date": {
Copy link
Collaborator

@natemcintosh natemcintosh Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each report date runs on all the reference dates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm -- yeah this is a good flag. That's a bad assumption. Let me revisit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to be, for at least the EpiNow2 example, this vector of dates corresponding to the time series data passed in? E.g the date of admissions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for EpiNow2, you would just have a single report date? So for example for this week if I ran EpiNow2 today, assuming old NHSN data reporting. I'd have:
as_of_date = "2024-08-08",
report_date = "2024-08-07",
reference_date = a vector of dates going back some specified calibration period up until "2024-08-02" (last friday)

"type": "array",
"items": {
"type": "string",
"format": "date"
}
}
},
"required": [
"blob_storage_container",
"path",
"reference_date",
"report_date"
],
"title": "Data"
},
"Parameters": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to point to more than one container to pull in parameters?

"type": "object",
"additionalProperties": false,
"properties": {
"path": {
"type": "string"
},
"blob_storage_container": {
"type": ["string", "null"]
}
},
"required": [
"blob_storage_container",
"path"
],
"title": "Parameters"
},
"Priors": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you always needs to pass in these arguments or is this specific to Epinow2? I think there might be other packages where you would handle specifying priors differently (e.g. in the ww package we're developing, priors are lumped in with parameters...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that you'd have to edit this in another version of the repo to enforce that schema, but I think the framework is adaptable!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that something like fetch_config is not EpiNow2 specific there is some argument it shouldn't be in this package. It seems like if we are to have another package cfa-newpackage-pipeline then it'll also need things function. So the schema is EpiNow2 specific but the surrounding functions are not

"type": "object",
"additionalProperties": false,
"properties": {
"rt": {
"$ref": "#/definitions/Rt"
},
"gp": {
"$ref": "#/definitions/Gp"
}
},
"required": [
"gp",
"rt"
],
"title": "Priors"
},
"Gp": {
"type": "object",
"additionalProperties": false,
"properties": {
"alpha_sd": {
"type": "number"
}
},
"required": [
"alpha_sd"
],
"title": "Gp"
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to add specification of the ls mean via gp_opts() https://epiforecasts.io/EpiNow2/reference/gp_opts.html

"Rt": {
"type": "object",
"additionalProperties": false,
"properties": {
natemcintosh marked this conversation as resolved.
Show resolved Hide resolved
"mean": {
"type": "integer"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why integer instead of number?

},
"sd": {
"type": "number"
}
},
"required": [
"mean",
"sd"
],
"title": "Rt"
},
"SamplerOpts": {
"type": "object",
"additionalProperties": false,
"properties": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the number of samples run here as a parameter -- seems like something we could want to change at runtime, and I think helpful to encode it explicitly instead of using the defaults

"cores": {
"type": "integer"
},
"chains": {
"type": "integer"
},
"adapt_delta": {
"type": "number"
},
"max_treedepth": {
"type": "integer"
}
},
"required": [
"adapt_delta",
"chains",
"cores",
"max_treedepth"
],
"title": "SamplerOpts"
}
}
}
49 changes: 49 additions & 0 deletions man/fetch_config.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 28 additions & 0 deletions man/validate_config.Rd
32 changes: 32 additions & 0 deletions tests/testthat/data/bad_sample_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@

{
"as_of_date": "2023-01-01",
"geo_value": "test",
"geo_type": "test",
"report_date": [
"01-01"
],
"reference_date": [
"2023-01-01",
"2022-12-30",
"2022-12-29"
],
"seed": "abc",
"horizon": 14,
"priors": {
"rt": {
"mean": 1.0,
"sd": 0.2
},
"gp": {
"alpha_sd": 0.01
}
},
"sampler_opts": {
"cores": 4,
"chains": 4,
"adapt_delta": 0.99,
"max_treedepth": 12,
"not_a_parameter": -12
}
}
41 changes: 41 additions & 0 deletions tests/testthat/data/sample_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
{
"job_id": "6183da58-89bc-455f-8562-4f607257a876",
"task_id": "bc0c3eb3-7158-4631-a2a9-86b97357f97e",
"as_of_date": "2023-01-01",
"disease": "test",
"geo_value": ["test"],
"geo_type": "test",
"parameters": {
"path": "data/parameters.parquet",
"blob_storage_container": null
},
"data": {
"path": "gold/",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written, how would you point to multiple data sources when it seems that data just has one path option

"blob_storage_container": null,
zsusswein marked this conversation as resolved.
Show resolved Hide resolved
"report_date": [
"2023-01-01"
],
"reference_date": [
natemcintosh marked this conversation as resolved.
Show resolved Hide resolved
"2023-01-01",
"2022-12-30",
"2022-12-29"
]
},
"seed": 42,
"horizon": 14,
"priors": {
"rt": {
"mean": 1.0,
"sd": 0.2
},
"gp": {
"alpha_sd": 0.01
}
},
"sampler_opts": {
"cores": 4,
"chains": 4,
"adapt_delta": 0.99,
"max_treedepth": 12
}
}
44 changes: 44 additions & 0 deletions tests/testthat/test-fetch_config.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
test_that("Test config loads", {
config_path <- test_path("data/sample_config.json")

expected <- jsonlite::read_json(config_path)
actual <- fetch_config(
config_path = config_path,
local_dest = NULL,
blob_storage_container = NULL
)

expect_equal(actual, expected)
})

test_that("Bad config errors", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use more descriptive test name?

config_path <- test_path("data/bad_sample_config.json")

expect_error(
{
fetch_config(
config_path = config_path,
local_dest = NULL,
blob_storage_container = NULL
)
},
class = "CFA_Rt"
)
})

test_that("Test config validates", {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same: use more descriptive test name?

config_path <- test_path("data/sample_config.json")

expect_true(
validate_config(
config_path
)
)
})


test_that("Bad config errors", {
config_path <- test_path("data/bad_sample_config.json")

expect_error(validate_config(config_path))
})