Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

only restict to first if many observations per subjectId #419

Merged
merged 1 commit into from
Oct 31, 2023

Conversation

egillax
Copy link
Collaborator

@egillax egillax commented Oct 31, 2023

@jreps

When we create our population with firstExposureOnly set to `TRUE there is a very expensive group_by operation done:

population <- population %>%
    dplyr::arrange(.data$subjectId,.data$cohortStartDate) %>%
    dplyr::group_by(.data$subjectId) %>%
    dplyr::filter(dplyr::row_number(.data$subjectId)==1)

The amount of groups is equal to amount of subjects, so possibly millions.

When there is only one observation per subjectId this is a no-op and can be skipped.

In this PR I've added a condition:

(nrow(population) > dplyr::n_distinct(population$subjectId))

If rows in population are more than amount of subjects (aka more observations than subjects), then and only then we do this expensive operation.

This took the population generation time down from about 1.4 minutes to 0.8 seconds for a cohort of about 460k.

@egillax egillax requested a review from jreps October 31, 2023 09:50
@jreps jreps merged commit 3dbcd90 into develop Oct 31, 2023
8 checks passed
@egillax egillax deleted the faster_population branch November 30, 2023 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants