Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohort templates #134

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from
Draft

Cohort templates #134

wants to merge 6 commits into from

Conversation

azimov
Copy link
Collaborator

@azimov azimov commented Apr 8, 2024

See issue #133

TODO:
*Vignette on implementation (will wait for agreement on approach before writing)

@azimov azimov marked this pull request as draft April 8, 2024 18:28
@azimov
Copy link
Collaborator Author

azimov commented Apr 8, 2024

@mdlavallee92 I would appreciate your input here, from the perspective of Capr which could be used to generate templates for broad, highly sensitive phenotypes.

@mdlavallee92
Copy link

Thanks for tagging me to the review @azimov. In terms of a code review, I will admit there is not much I can contribute since this falls outside of the Capr realm and my grasp of R6 is limited. Most of my comments are philosophical wrt the purpose of the tool.

Conceptual Understanding
If I understand this correctly, essentially this would be like helpers to instantiate a set of cohorts based on a vocabulary (i.e. all rxNorm Ingredients or all ATC 4th class). So, the workflow would be like build Circe study pop plus all drug ingredient cohorts in the database based on rxNorm (for example). What is the motivation for templating out ingredient cohorts? Are you using them for characterization or modelling features or both? Also how does this differentiate from using either of the eras tables in the cdm? If characterizing, an approach would be to use concept sets instead of cohorts.

Efficiency
Using the vocabularies to build simple cohorts up front is a far more efficient than using Capr (since I am not building all these vocab based cohorts on 'n' CIRCE sql). I think it would be something I would use.

Other templates
My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

Non-CIRCE workflow
Final comment is more general. I am softening my stance towards using non-Circe cohorts, particularly in internal studies. While my preferred approach is to always be in CIRCE for its reproducibility attributes, there are many times I have to build non-trivial subsets on say measurements or a cdm domain not covered in circe (ie. episode or cost). Generally, my approach for an OHDSI study is that the study population (target) should be in Circe because it is something I can reproduce across different studies, iterate on changes with colleagues via ATLAS and I have an auditable file via json. While in an internal study in theory you can approach the CDM in any way you want, a more principled approach is to treat any study as if it were a network study. Having said this it becomes a real headache to keep track of all these non-CIRCE cohorts. The best I can come up with is do custom sql magic and then reroute this work into the nx4 cohort table structure to access this cohort (subset or ow) later on in my study. So, I appreciate ways to standardize some of these non-Circe routes of cohort building as you have done here.

@azimov
Copy link
Collaborator Author

azimov commented Apr 9, 2024

My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

I would say that this is already possible with subset operations included in this repository (saving, loading and using in strategus designs is already well supported too).

For your example question:

library(CohortGenerator)

cohortDefinitionSet <- ROhdsiWebApi::getMyCohorts(...) # could be the target cohorts in this PR

subsetDefinition <- createSubsetDefinition(
    name = "MALES, in 2010",
    definitionId = 1,
    subsetOperations = list(
       # Applied in order
       createLimitSubset(
           name = "Year 2010",
           calendarStartDate = "2010/1/1",
           calendarEndDate = "2010/12/31"
       ),
       createDemographicSubset(
          name = ">=18",
          ageMin = 18
        )
    )
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef, targetCohortIds = c(1,2,3,4)) # If not specified, it will apply to all

generateCohortDefinitionSet(...)

This will save you a lot of time and keep clutter out of atlas.

@azimov
Copy link
Collaborator Author

azimov commented Apr 9, 2024

Conceptual Understanding
If I understand this correctly, essentially this would be like helpers to instantiate a set of cohorts based on a vocabulary (i.e. all rxNorm Ingredients or all ATC 4th class). So, the workflow would be like build Circe study pop plus all drug ingredient cohorts in the database based on rxNorm (for example).

I definitely see the benefits of getting characteristics for concept sets, but the motivation here is really to have cohorts generated from pure SQL to be treated in the same manner as all other cohorts in packages. At the moment we are doing sql in a lot of places anyway. Personally I have used a lot of hacks to get this done and I want an approach that is cleaner and re-usable in any tools.

What is the motivation for templating out ingredient cohorts?

We are using these in our comparator selection tool and other places to do things at scale. The logic is a bit more than the drug eras table (for example, requiring prior observation).

Using the vocabularies to build simple cohorts up front is a far more efficient than using Capr (since I am not building all these vocab based cohorts on 'n' CIRCE sql). I think it would be something I would use.

This is good to know. In a dream world I would like to go from a templated cohort to a circe/capr compatable definition with a work flow being like:

1. Template a bunch of sensitive cohorts
2. Evaluate which is best, propose potential changes
3. Extract circe definition
4. Add extra features (e.g. relevant measurement, procedures or other codes unrelated to primary diagnosis)
5. Publish to atlas as phenotype

@mdlavallee92
Copy link

mdlavallee92 commented Apr 9, 2024

My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

I would say that this is already possible with subset operations included in this repository (saving, loading and using in strategus designs is already well supported too).

For your example question:

library(CohortGenerator)

cohortDefinitionSet <- ROhdsiWebApi::getMyCohorts(...) # could be the target cohorts in this PR

subsetDefinition <- createSubsetDefinition(
    name = "MALES, in 2010",
    definitionId = 1,
    subsetOperations = list(
       # Applied in order
       createLimitSubset(
           name = "Year 2010",
           calendarStartDate = "2010/1/1",
           calendarEndDate = "2010/12/31"
       ),
       createDemographicSubset(
          name = ">=18",
          ageMin = 18
        )
    )
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef, targetCohortIds = c(1,2,3,4)) # If not specified, it will apply to all

generateCohortDefinitionSet(...)

This will save you a lot of time and keep clutter out of atlas.

I meant the other way around. denominator is the Males in 2010 for the entire db and the numerator is the study population. Can CG build a set of year, gender, race cohorts with no index event as well? The term "cohort" is used liberally here, more like id gender race year counts...could be out of scope of CG.

But yes, the subsetting you outline here is highly useful and have done it within studies.

@azimov
Copy link
Collaborator Author

azimov commented Apr 9, 2024

I meant the other way around. denominator is the Males in 2010 for the entire db and the numerator is the study population. Can CG build a set of year, gender, race cohorts with no index event as well? The term "cohort" is used liberally here, more like id gender race year counts...could be out of scope of CG.

This could be done with templates, but it might make your DBA sweat as storing all demographics for males in a given year could be huge percentage of peeople, so I would say that it would be much better to just generate counts in this context (Patrick will have a lot to say to you about computing incidence rates in this sort of population though).

Generally something like this should live in the CohortIncidence package but there we're trying to be quite specific here - for meaningful measures you want a target population and an outcome as well as a time at risk.

# Conflicts:
#	R/CohortConstruction.R
Copy link

codecov bot commented Oct 24, 2024

Codecov Report

Attention: Patch coverage is 81.38686% with 51 lines in your changes missing coverage. Please review.

Project coverage is 95.45%. Comparing base (4ce1ae0) to head (73d3692).
Report is 59 commits behind head on develop.

Files with missing lines Patch % Lines
R/TemplateImplementations.R 66.15% 44 Missing ⚠️
R/CohortConstruction.R 89.47% 4 Missing ⚠️
R/TemplateCohorts.R 96.80% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #134      +/-   ##
===========================================
- Coverage    98.80%   95.45%   -3.36%     
===========================================
  Files           12       17       +5     
  Lines         1336     2154     +818     
===========================================
+ Hits          1320     2056     +736     
- Misses          16       98      +82     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL cohorts as first class citizens - SQL templates for non-standard cohorts or large, bulk operations
2 participants