-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add initial 'prioritize_dt' helper function #32
Conversation
quiet = TRUE | ||
) | ||
testthat::expect_identical(output_dt, expected_dt) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to Katie's suggestions above, would it be helpful to add a test where there are two sources with the same method for the same year to make sure it errors out how we would expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I think what you're describing is what I added on line 40. Including the details below and let me know if you think this doesn't test that scenario.
where the test input has two separate reports with two methods each but the rank_order
only includes prioritization of method which leads to duplicate prioritization of
> input_dt[year == 2000 & age_start == 0]
location year age_start report method value
1: USA 2000 0 2015 A 1
2: USA 2000 0 2015 B 1
3: USA 2000 0 2020 A 1
4: USA 2000 0 2020 B 1
> rank_order["method"]
$method
[1] "B" "A"
The error looks like
> prioritize_dt(
+ dt = input_dt,
+ rank_by_cols = c("location", "year"),
+ unique_id_cols = c("location", "year", "age_start"),
+ rank_order = rank_order["method"]
+ )
Error in prioritize_dt(dt = input_dt, rank_by_cols = c("location", "year"), :
Specified `rank_by_cols`, `rank_order` & returned `priority` do not uniquely identify each row of `dt`.
- use `warn_non_unique_priority=TRUE` to return `dt` and run demUtils::identify_non_unique_dt
with `id_cols = c('location', 'year', 'age_start', 'priority')`
location year age_start priority
1: USA 2000 0 1
2: USA 2000 0 2
3: USA 2000 5 1
4: USA 2000 5 2
If we do the same call with warn_non_unique_priority = TRUE
, the output looks below where the same priority is assigned to the different reports of the same method.
location year age_start report method value priority
1: USA 2000 0 2015 A 1 2
2: USA 2000 0 2015 B 1 1
3: USA 2000 0 2020 A 1 2
4: USA 2000 0 2020 B 1 1
Describe changes
Adds initial prioritization/deduplication function that we can use internally to prioritize different reports of values from the same data collection origin.
This is based off of the population data prioritization code. Are there other scenarios that wouldn't be handled by this function? See function example and tests for example scenarios.
What issues are related
Fixes #1
Checklist
Packages Repositories
ihmeuw-demographics
R packages?devtools::check()
locally?devtools::document()
?ihmeuw-demographics
code style?docker-base
ordocker-internal
? If so follow directions in those repositories to rebuild and redeploy the images.