-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[1/2]: Read in case data #17
Conversation
d53ba6e
to
ee26378
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅ Additional details and impacted files📢 Thoughts on this report? Let us know! |
602ed84
to
767db01
Compare
This comment was marked as resolved.
This comment was marked as resolved.
So calling |
Yes! |
This comment was marked as resolved.
This comment was marked as resolved.
1 similar comment
This comment was marked as resolved.
This comment was marked as resolved.
7c5d755
to
e01a94d
Compare
a26030d
to
0844ae0
Compare
This comment was marked as resolved.
This comment was marked as resolved.
a1da73b
to
5fc4349
Compare
2a8b870
to
f7f8567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just superficial things! (I think reading this is helping me get up to speed nonetheless)
Just to note without context here that this sounds odd to me. If you aggregate data it'd seem to be harder to then retroactively go in and remove data points from it, rather than do the removing data first then to aggregate. I guess you must have a reason to be doing this. Since I guess now you'd have to be tracking somehow in the code all of the entries that went into a particular aggregate so that if you remove data from them you can remove data from the aggregate. (As in this seems more challenging than it need be.) |
Yeah this is a good flag. I don't know if this repo is the right place to document it, but here's the gist of our ongoing discussion around data exclusions:
With the key question being, should the NA propagate? When we exclude a state's point, should we also exclude it from the US overall? What about when we exclude a whole state's time series? We've landed on "No, we should make judgement calls about each aggregated time series individually." But this discussion is still live and we're trying to get to a better place from both the data review/cleaning side and the modeling side. |
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Adam Howes <[email protected]>
Co-authored-by: Katie Gostic (she/her) <[email protected]>
Co-authored-by: Katie Gostic (she/her) <[email protected]>
717858a
to
5f8cd56
Compare
* Add simulated data from Gostic, 2020 for benchmarking This commit re-uses the data processing and documentation from CDCgov/cfa-epinow2-pipeline#17. That repo is open source and public domain by nature of being USG property. I think it's convenient to re-use, but @athowes and @kgostic there's a little text in there that you both suggested, so I'd appreciate if you could give your permission for the re-use here! I've made sure to credit you both in the commit as it's partially your writing. It's probably best practice to make the data prep and processing here fully reproducible in `data-raw/` but it's quite a pain to do, with a mixture of shells cripting and R scripting needed. I skipped it out of convenience, but if you have a really strong feeling that it's required @seabbs, let me know and we can revisit. Closes #9 Co-authored-by: Adam Howes <[email protected]> Co-authored-by: Katie Gostic <[email protected]> * Bump NEWS * Coerce new `obs_incidence` col to integer from double * Generate GI PMF in `data-raw/` * Point to specific CFAEpiNow2Pipeline commit * Remove copied & pasted unrelated text * Re-render roxygen docs * Add primarycensoreddist to remotes * Typo * Rename synthetic dataset and GI dist * Pin primarycensorreddist to R-universe not github --------- Co-authored-by: Adam Howes <[email protected]> Co-authored-by: Katie Gostic <[email protected]>
Important
Do not merge. Using a stacked PR workflow.
Read the case data from a local parquet file in our standard ETL output schema. The various errors & warnings are linked to the tests via classes.
This approach preserves our existing handling of the data with respect to exclusions: it aggregates before there have been any exclusions. Applying exclusions to the aggregated data will be handled in the second stack in this PR.
I also add in here the data from Gostic, 2020. I use it in the unit tests to practice reading in data from
tests/testthat/data/
but I also add it as package data (with documentation). We use this dataset in our end-to-end testing, so its addition here is also meant with an eye to that future use.