only restict to first if many observations per subjectId #419

egillax · 2023-10-31T09:50:26Z

When we create our population with firstExposureOnly set to `TRUE there is a very expensive group_by operation done:

population <- population %>%
    dplyr::arrange(.data$subjectId,.data$cohortStartDate) %>%
    dplyr::group_by(.data$subjectId) %>%
    dplyr::filter(dplyr::row_number(.data$subjectId)==1)

The amount of groups is equal to amount of subjects, so possibly millions.

When there is only one observation per subjectId this is a no-op and can be skipped.

In this PR I've added a condition:

(nrow(population) > dplyr::n_distinct(population$subjectId))

If rows in population are more than amount of subjects (aka more observations than subjects), then and only then we do this expensive operation.

This took the population generation time down from about 1.4 minutes to 0.8 seconds for a cohort of about 460k.

only restict to first if many observations per subjectId

81b9d02

egillax requested a review from jreps October 31, 2023 09:50

jreps merged commit 3dbcd90 into develop Oct 31, 2023
8 checks passed

egillax deleted the faster_population branch November 30, 2023 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only restict to first if many observations per subjectId #419

only restict to first if many observations per subjectId #419

egillax commented Oct 31, 2023

only restict to first if many observations per subjectId #419

only restict to first if many observations per subjectId #419

Conversation

egillax commented Oct 31, 2023