Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1/2]: Read in case data #17

Merged
merged 34 commits into from
Sep 10, 2024
Merged

[1/2]: Read in case data #17

merged 34 commits into from
Sep 10, 2024

Conversation

zsusswein
Copy link
Collaborator

@zsusswein zsusswein commented Aug 31, 2024

Important

Do not merge. Using a stacked PR workflow.

Read the case data from a local parquet file in our standard ETL output schema. The various errors & warnings are linked to the tests via classes.

This approach preserves our existing handling of the data with respect to exclusions: it aggregates before there have been any exclusions. Applying exclusions to the aggregated data will be handled in the second stack in this PR.

I also add in here the data from Gostic, 2020. I use it in the unit tests to practice reading in data from tests/testthat/data/ but I also add it as package data (with documentation). We use this dataset in our end-to-end testing, so its addition here is also meant with an eye to that future use.

Copy link

codecov bot commented Aug 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files

📢 Thoughts on this report? Let us know!

@zsusswein zsusswein force-pushed the zs-read-data branch 3 times, most recently from 602ed84 to 767db01 Compare September 2, 2024 11:40
@zsusswein

This comment was marked as resolved.

@natemcintosh
Copy link
Collaborator

So calling /document causes it to run the documentation CI for you?

@zsusswein
Copy link
Collaborator Author

Yes!

@zsusswein

This comment was marked as resolved.

1 similar comment
@zsusswein

This comment was marked as resolved.

@zsusswein zsusswein changed the title Read in case data [1/2]: Read in case data Sep 3, 2024
@zsusswein zsusswein marked this pull request as ready for review September 3, 2024 21:42
@zsusswein

This comment was marked as resolved.

DESCRIPTION Show resolved Hide resolved
R/read_data.R Show resolved Hide resolved
R/read_data.R Outdated Show resolved Hide resolved
R/read_data.R Outdated Show resolved Hide resolved
R/read_data.R Show resolved Hide resolved
R/read_data.R Outdated Show resolved Hide resolved
R/read_data.R Outdated Show resolved Hide resolved
data-raw/convert_gostic_toy_rt_to_test_dataset.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@athowes athowes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just superficial things! (I think reading this is helping me get up to speed nonetheless)

R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/read_data.R Show resolved Hide resolved
R/read_data.R Show resolved Hide resolved
tests/testthat/test-read_data.R Show resolved Hide resolved
tests/testthat/test-read_data.R Show resolved Hide resolved
@athowes
Copy link
Collaborator

athowes commented Sep 5, 2024

This approach preserves our existing handling of the data with respect to exclusions: it aggregates before there have been any exclusions. Applying exclusions to the aggregated data will be handled in the second stack in this PR.

Just to note without context here that this sounds odd to me. If you aggregate data it'd seem to be harder to then retroactively go in and remove data points from it, rather than do the removing data first then to aggregate. I guess you must have a reason to be doing this. Since I guess now you'd have to be tracking somehow in the code all of the entries that went into a particular aggregate so that if you remove data from them you can remove data from the aggregate. (As in this seems more challenging than it need be.)

@zsusswein
Copy link
Collaborator Author

zsusswein commented Sep 5, 2024

Yeah this is a good flag. I don't know if this repo is the right place to document it, but here's the gist of our ongoing discussion around data exclusions:

  • We receive and store facility-level data. These facilities are tagged with their corresponding state.
  • We want to produce state- and national-level estimates, so on-the-fly aggregate our stored data up to the desired level (this PR!)
  • We include right-truncation in reporting in the model. We provide a right-truncation PMF based on historical geographic-aggregate-level right-truncation.
  • Sometimes one geographic aggregate or another will have a one-off week where right-truncation empirically deviates from historical trend (e.g. 0 cases reported when we normally see ~some)
  • In those cases, we consult with the dataset expert and sometimes drop the single day with anomalous reporting for that particular geographic aggregate by converting the aggregate point to NA.
  • Sometimes the anomalous reporting is widespread enough that we judge that the modeling is not reflective of disease trends and exclude the whole state's model fit for the week.
  • This exclusion does not propagate to other geographic aggregates (e.g. the US overall aggregate) or future report dates. All data are included, even if a different aggregate using a portion of the same data is excluded.

With the key question being, should the NA propagate? When we exclude a state's point, should we also exclude it from the US overall? What about when we exclude a whole state's time series?

We've landed on "No, we should make judgement calls about each aggregated time series individually." But this discussion is still live and we're trying to get to a better place from both the data review/cleaning side and the modeling side.

@zsusswein zsusswein requested a review from athowes September 5, 2024 13:58
R/data.R Outdated Show resolved Hide resolved
@zsusswein

This comment was marked as resolved.

Copy link
Collaborator

@kgostic kgostic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
R/read_data.R Show resolved Hide resolved
R/utils.R Outdated Show resolved Hide resolved
R/data.R Outdated Show resolved Hide resolved
@zsusswein zsusswein merged commit a212f50 into main Sep 10, 2024
6 checks passed
@zsusswein zsusswein deleted the zs-read-data branch September 10, 2024 15:37
zsusswein added a commit to CDCgov/cfa-gam-rt that referenced this pull request Sep 12, 2024
* Add simulated data from Gostic, 2020 for benchmarking

This commit re-uses the data processing and documentation from CDCgov/cfa-epinow2-pipeline#17. That repo is open source and public domain by nature of being USG property. I think it's convenient to re-use, but @athowes and @kgostic there's a little text in there that you both suggested, so I'd appreciate if you could give your permission for the re-use here! I've made sure to credit you both in the commit as it's partially your writing.

It's probably best practice to make the data prep and processing here fully reproducible in `data-raw/` but it's quite a pain to do, with a mixture of shells cripting and R scripting needed. I skipped it out of convenience, but if you have a really strong feeling that it's required @seabbs, let me know and we can revisit.

Closes #9

Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Katie Gostic <[email protected]>

* Bump NEWS

* Coerce new `obs_incidence` col to integer from double

* Generate GI PMF in `data-raw/`

* Point to specific CFAEpiNow2Pipeline commit

* Remove copied & pasted unrelated text

* Re-render roxygen docs

* Add primarycensoreddist to remotes

* Typo

* Rename synthetic dataset and GI dist

* Pin primarycensorreddist to R-universe not github

---------

Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Katie Gostic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants