Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add initial 'prioritize_dt' helper function #32

Merged
merged 3 commits into from
Apr 2, 2021

Conversation

chacalle
Copy link
Contributor

Describe changes

Adds initial prioritization/deduplication function that we can use internally to prioritize different reports of values from the same data collection origin.

This is based off of the population data prioritization code. Are there other scenarios that wouldn't be handled by this function? See function example and tests for example scenarios.

What issues are related

Fixes #1

Checklist

Packages Repositories

  • Have you read the contributing guidelines for ihmeuw-demographics R packages?
  • Have you successfully run devtools::check() locally?
  • Have you updated or added function (and vignette if applicable) documentation? Did you update the 'man' and 'NAMESPACE' files with devtools::document()?
  • Have you added in tests for the changes included in the PR?
  • Do the changes follow the ihmeuw-demographics code style?
  • Do the changes need to be immediately included in a new build of docker-base or docker-internal? If so follow directions in those repositories to rebuild and redeploy the images.
  • Do the changes require updates to other repositories which use this package? If yes, make the necessary updates in those repos, and consider integration tests for those repositories.

@chacalle chacalle added the enhancement New feature or request label Mar 29, 2021
@chacalle chacalle self-assigned this Mar 29, 2021
krpaulson
krpaulson previously approved these changes Mar 29, 2021
R/prioritize_dt.R Outdated Show resolved Hide resolved
R/prioritize_dt.R Show resolved Hide resolved
R/prioritize_dt.R Show resolved Hide resolved
R/prioritize_dt.R Show resolved Hide resolved
quiet = TRUE
)
testthat::expect_identical(output_dt, expected_dt)
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Katie's suggestions above, would it be helpful to add a test where there are two sources with the same method for the same year to make sure it errors out how we would expect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I think what you're describing is what I added on line 40. Including the details below and let me know if you think this doesn't test that scenario.

where the test input has two separate reports with two methods each but the rank_order only includes prioritization of method which leads to duplicate prioritization of

> input_dt[year == 2000 & age_start == 0]
   location year age_start report method value
1:      USA 2000         0   2015      A     1
2:      USA 2000         0   2015      B     1
3:      USA 2000         0   2020      A     1
4:      USA 2000         0   2020      B     1
> rank_order["method"]
$method
[1] "B" "A"

The error looks like

> prioritize_dt(
+       dt = input_dt,
+       rank_by_cols = c("location", "year"),
+       unique_id_cols = c("location", "year", "age_start"),
+       rank_order = rank_order["method"]
+     )
Error in prioritize_dt(dt = input_dt, rank_by_cols = c("location", "year"),  : 
  Specified `rank_by_cols`, `rank_order` & returned `priority` do not uniquely identify each row of `dt`.
	- use `warn_non_unique_priority=TRUE` to return `dt` and run demUtils::identify_non_unique_dt
	 with `id_cols = c('location', 'year', 'age_start', 'priority')`
    location year age_start priority
 1:      USA 2000         0        1
 2:      USA 2000         0        2
 3:      USA 2000         5        1
 4:      USA 2000         5        2

If we do the same call with warn_non_unique_priority = TRUE, the output looks below where the same priority is assigned to the different reports of the same method.

     location year age_start report method value priority
  1:      USA 2000         0   2015      A     1        2
  2:      USA 2000         0   2015      B     1        1
  3:      USA 2000         0   2020      A     1        2
  4:      USA 2000         0   2020      B     1        1

@chacalle chacalle merged commit 86b7903 into master Apr 2, 2021
@chacalle chacalle deleted the feature/prioritize_dt branch April 2, 2021 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplication function
3 participants