-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADES-wide dependency management and strategy #11
Comments
This conversation might be better framed as a tradeoff between "build yourself" vs "borrow and share", a decision that Stephen Wali outlined in the open source workshop earlier this year. A significant benefit of taking on dependencies is that we save ourselves the work of implementing and maintaining some functionality. I like the idea trying to limit dependencies and analyzing our dependencies. If a package is only using one function from a large package it might make sense to implement single function and remove the dependency. We also have the option to add dependencies as SUGGESTS packages for optional functionality that might only be needed by a subset of users (e.g. some PLP libraries for example). I'd like to suggest rOpenSci as a potential source for reliable packages we can depend on. I second that RStudio's tidyverse packages (as well as supporting packages like rlang) tend to be reliable and RStudio seems to do good job of communicating changes using lifecyles. This would be a great project for someone in Kheiron. This talk from the 2020 RStudio conference outlines their strategy around lifecycle management (starting at 13 minutes in). https://www.rstudio.com/resources/rstudioconf-2020/state-of-the-tidyverse/ Criteria suggestions for discussion: "Must" indicates a requirement. "Should" indicates a guideline.
|
Thanks @ablack3 , points well taken. I don't think we currently depend on any rOpenSci packages? I would love for someone in Khieron to work on this. A first step would be an analysis of current dependencies. Then proposal for requirements and guidelines, followed by implementation of those. |
@schuemie and @ablack3 I am part of the 2022 Khieron (or is it Kheiron?) cohort and would like to see if I have the chops to do this. This can be broken down into two phases: (a) build a database of direct dependencies (b) scan for candidates which can be substituted based on frequency of usage across packages, availability of alternatives and effort involved. If I were to start with https://ohdsi.github.io/Hades/packageStatuses.html, go through each github repo listed here:
Your thoughts? |
Awesome @gkovaig! I like your idea of using the namespace file so we can track which functions are being imported. I found some ideas on how to quickly list all package dependencies here. Here is one idea for getting at the list of direct dependencies. (Update 7/20: add drat repo to capture non-CRAN Hades packages) library(dplyr)
# install.packages("drat")
# install all of the Hades packages - can take a few tries.
# devtools::install_github("OHDSI/Hades")
drat::addRepo("OHDSI") # so that available.packages includes Hades packages not on CRAN
hadesPackages <- Hades::listHadesPackages()[,"name"]
df <- available.packages() %>%
as_tibble() %>%
filter(Package %in% hadesPackages) %>%
select(Package, Depends, Imports, Suggests) %>%
tidyr::pivot_longer(cols = c(-Package), names_to = "type", values_to = "dependency") %>%
mutate(dependency = stringr::str_split(dependency, ",")) %>%
tidyr::unnest(c(dependency)) %>%
mutate(dependency = stringr::str_trim(dependency))
df
#> # A tibble: 415 × 3
#> Package type dependency
#> <chr> <chr> <chr>
#> 1 Andromeda Depends dplyr
#> 2 Andromeda Imports RSQLite
#> 3 Andromeda Imports DBI
#> 4 Andromeda Imports zip
#> 5 Andromeda Imports methods
#> 6 Andromeda Imports dbplyr
#> 7 Andromeda Imports tidyselect
#> 8 Andromeda Imports cli
#> 9 Andromeda Imports rlang
#> 10 Andromeda Imports pillar
#> # … with 405 more rows
#> # ℹ Use `print(n = ...)` to see more rows
# list Hades packages that are not included
setdiff(hadesPackages, df$Package)
#> [1] "EnsemblePatientLevelPrediction"
readr::write_csv(df, "hadesDependencies.csv")
df %>%
count(dependency, sort = T) %>%
print(n=20)
#> # A tibble: 166 × 2
#> dependency n
#> <chr> <int>
#> 1 testthat 22
#> 2 knitr 19
#> 3 rmarkdown 19
#> 4 rlang 16
#> 5 dplyr 13
#> 6 methods 12
#> 7 rJava 9
#> 8 Eunomia 8
#> 9 readr 8
#> 10 withr 8
#> 11 checkmate 7
#> 12 ggplot2 7
#> 13 R (>= 3.5.0) 7
#> 14 RJSONIO 6
#> 15 utils 6
#> 16 cli 5
#> 17 jsonlite 5
#> 18 shiny 5
#> 19 stringr 5
#> 20 survival 5
#> # … with 146 more rows
#> # ℹ Use `print(n = ...)` to see more rows Created on 2022-07-20 by the reprex package (v2.0.1) It would be nice to know which functions are being used from these and how many times they are used. renv::dependencies function will scan source code for dependencies so that might be helpful. |
Law of dependent (software) origination: "All programs arise in dependence upon other programs" 😆 |
Thank you @ablack3 - that link to SO was super helpful! You already have a first-cut solution here! Building on your idea, a core issue (that Perhaps I have looked at samples which are the exception, rather than the rule. @schuemie I recall seen some posts where you have passionately advocated for using |
renv is used to reproducibly run R analysis scripts/projects and is not intended to be used as a way to capture R package dependencies.
For package dependencies we use the DESCRIPTION file which allows you specify all dependencies (and their exact or minimum versions if you need to).
We need renv for ohdsi-studies because we need to specify the exact environment that the study ran under so it can be reproduced. The main difference R packages and R projects is that packages are imported into other packages or projects so they need to have their own namespace and need to be maintained over time. R projects are what you might think of as a data analysis. They are not maintained and are not imported into other projects. They do not need a namespace. However the line can be blurred a bit and we can adopt aspects of R packages into our studies/projects if we want. For example we can add a tests directory to a project and use the testthat to add unit tests to a project. If you want to make your project into a full blown R package that is fine too but usually a data analysis does not need as much structure as a package because it only needs to run once. In OHDSI, since we are running a study across multiple data nodes, we have tended to use R packages as projects (which is why they have both a DESCRIPTION file and an renv.lock file). But these packages often just export a single R packages (i.e. tools maintained by developers) contain functionality that is used/imported into a study or data analysis. Users care that the functions in your package work today (the day that they are using the function). That means as an R package author your job is to make sure your code works today with the current versions of all of its dependencies. This means you don't need to worry too much about updates to other packages that don't break your code. It really does not matter if one of your dependencies is updated as long as the interface (function input/output) you are using does not change. It also means that if a lot of packages depend on your code then you should be very careful about changing the interface and I think CRAN expects some effort to that effect. This is perhaps a different perspective than one taken in other languages like Java where it is common to specify exact versions for all dependencies. A good analogy in the OHDSI world is the difference between a "concept set expression" and a "list of concept ids". A "concept set expression" resolves to a list of concepts based on the current vocabulary version. A list of concept IDs is immutable and will always represent the exact same concept set. The DESCRIPTION file lists direct dependencies that resolve to a complete (recursive) dependency list based on the current versions of all dependencies. The renv.lock file captures the complete (recursive) dependency list so that the whole R code stack stays exactly the same which is needed for reproducibility. In the DESCRIPTION file you can specify an exact package version for your dependency but people usually don't do this because it is an unnecessary restriction. More often if a dependency introduced a breaking change we would just say the version has to be >= (greater than or equal to) the one that introduced the change. The key thing I think we should remember is that we are depending on interfaces (function inputs/outputs) and only need to worry about breaking changes to those interfaces. Stability in our software is about stable interfaces. Sorry for jumping on the soapbox a bit. I have a lot to say on this topic apparently 😃 |
@gkovaig Here are some resources that might be helpful. RUniverse: technical, social, and scientific indicators of the health of open source R packages. CRAN downloads app As an example, Hades is using both jsonlite and RJSONIO but I suspect we might be able to pick one. We can see that jsonlite is much more heavily downloaded than RJSONIO. Comparing the number of reverse dependencies and recursive reverse dependencies between RJSONIO and jsonlite shows us that many more packages rely on jsonlite than RJSONIO. library(tools)
summary(package_dependencies(c("RJSONIO", "jsonlite"), reverse = T))
#> Length Class Mode
#> RJSONIO 38 -none- character
#> jsonlite 1024 -none- character
summary(package_dependencies(c("RJSONIO", "jsonlite"), reverse = T, recursive = T))
#> Length Class Mode
#> RJSONIO 63 -none- character
#> jsonlite 4803 -none- character Created on 2022-07-20 by the reprex package (v2.0.1) Looking at the source code of ROhdsiWebApi it appears that RJSONIO is used only to translate R objects to and from json which I think jsonlite provides with |
Great work @gkovaig ! I seem to remember there are subtle differences between RJSONIO and jsonlite which means we sometimes need one and sometimes the other. But I don't remember the details, and I think it is good to revisit the discussion, and if possible choose one. Will you be joining the HADES meeting tomorrow? It would be good to let people know you're working on this, and maybe brainstorm some ideas. |
Thank you @schuemie - appears the meeting is at 9am PST. Can join for the first 25 minutes. |
Making a note to also explore the pkgndep package as mentioned at our last HADES meeting. |
Thank you @schuemie - chatted with Edward Burns right after our meeting and obtained this starter script.
Based on the discussion at this meeting, task #1 is to obtain a comprehensive list of dependencies, with recursion. Optionally include reverse dependency counts. Potential candidates are: Scope is all packages in OHDSI/Hades. Task # 2 could be zeroing in on these potential candidates for consolidation. Obtain detailed usage in HADES and 'popularity' (via reverse dependencies) in other CRAN packages (based on pointers from @ablack3 ) p.s. should we move these notes and discussion into Teams for easier editing, commenting, etc.? |
My vote would be to keep the progress updates and topic specific discussion in this thread and use Teams for 1:1 communication or for discussion with the broader OSC group or Hades group. With regard to dependencies we should also include Java dependencies as well since several Hades package depend on Java packages. In at least one case, a Hades package depends on another Hades package through Java but not R. I think having careful strategy about the interaction between in Java and R components and how packages in one language depend on packages in the other would be quite helpful. For example: The DatabaseConnector and SqlRender repositories contain both R and Java packages combined together.
The SqlRender java package is imported into WebAPI and pulled in from repo.ohdsi.org I think. I think that credentials needed to see the packages in the OHDSI nexus repo right @leeevans? https://repo.ohdsi.org/nexus/ However WebAPI is not part of Hades so maybe out of scope for this discussion. CirceR is part of Hades and contains compiled java code that is from the circe Java package that is in a separate repo called circe-be. The circe java package also depends on SqlRender so I think it is fair to say that CirceR depends on SqlRender even though that dependency is not listed in the R package DESCRIPTION file. Are all the jar files in CirceR built from the source code in circe-be? I'm not sure. Each of the ohdsi java packages have their own java package dependencies. |
Although I agree mapping Java dependencies is a nice to have, I don't want to scope to creep too much. My original reason for starting this thread was to get a handle on our external dependencies to
With regards to the second point, I think the biggest opportunities may not come from consolidating multiple packages that do the same thing, but rather by identifying dependencies that can very easily be removed. An example that comes to mind is the fact that DatabaseConnector used to depend on the uuid package. The only reason for that was one point in the code where random strings need to be generated. Generating random strings can easily be done with base R functions, so it was easy to drop this dependency that no other HADES package had. In this case it didn't have a big impact, since uuid doesn't depend on anything else, but I'm sure we can identify dependencies where removing them will greatly reduce the number of unique dependencies of HADES. |
@schuemie @ablack3 I have forked OHDSI/Hades and created a Rmd notebook in the extras folder https://github.com/gkovaig/Hades/blob/main/extras/OHDSI_Hades_R_Package_Dependencies.Rmd. Next step: find a way to parse DESCRIPTION of non-CRAN packages in Hades and append them to a dataframe of metadata from CRAN packages, then run through tools::package_dependencies() . Appreciate if you can eye-ball this to see if I'm headed in the right direction and captured your intent correctly. |
Hi @gkovaig, I think you're correct that the first step is to create a list of all Hades R package dependencies. I think the code example above that uses the OHDSI drat repository might capture all of them but you should double check. The second step is to review the complete list of dependencies an identify pairs of packages that provide overlapping functionality (for example rjsonlite and RJSONIO). It would be helpful to have some metrics on how heavily packages are used. For each Hades dependency we would like to know
Read the pkgndep vignettes and explore the heaviness of Hades R package dependencies. These are some ideas. |
@ablack3 Thank you for the detailed notes and hints. I went ahead and extended the code to use Updates are in https://github.com/gkovaig/Hades/blob/main/extras/OHDSI_Hades_R_Package_Dependencies.Rmd. There are also two more artifacts in the same 'extras' folder, for the 'CSV' and the 'HTML' produced by Some observations:
Specifically, I tried to use Hope this is a good starting point. Looking forward to your comments. |
This is great! Could you share some more information on all HADES packages? (like heaviness per HADES package) I think |
Awesome! I've uploaded the all-HADES page here for now, for easy viewing. The two PLP packages bring in a lot of dependencies unique to those two, which doesn't surprise me. What does surprise me is the heaviness of EvidenceSynthesis. If you look at its list of imports, it's not obvious where this comes from. Is there a way to analyze what causes this explosion of dependencies? |
Thank you @schuemie . I added a viz https://github.com/gkovaig/Hades/blob/5f0045af687d3ab5d2c8a5b27dc57d7e3b7b0010/extras/OHDSI_Hades_R_Package_Dependencies.html for EvidenceSynthesis based on pkgdepR. This seems to add the function-level dependencies that @ablack3 requested earlier. The viz itself is interactive, so not sure if it renders in a useful way in the knitted HTML; in which case, you can re-run the last R code block in the Rmd file https://github.com/gkovaig/Hades/blob/5f0045af687d3ab5d2c8a5b27dc57d7e3b7b0010/extras/OHDSI_Hades_R_Package_Dependencies.Rmd. |
Thanks! Although I'm not sure how this visualization helps me understand where EvidenceSynthesis' heaviness comes from. Some manual digging suggests most of these unique dependencies come through the |
@schuemie You are right - the pkgdepR viz only brings up inter-dependencies between functions within EvidenceSynthesis, and not to specific calls to external packages like Came across this. So, with a stub like this: one could scan all R programs in a particular repo and collect all references to the external package (e.g. As a bonus, this could also help identify potential namespace conflicts. Downside: the repo in question must be cloned in the local filesystem, all referenced external packages must be installed and loaded into memory before this can be run. Would it be useful to dig further into this approach? |
@gkovaig This package might be helpful for enumerating function use https://brshallo.github.io/funspotr/ |
Dependencies are bad. They take a long time to install (e.g. by
renv
), and tend to be the cause when things break. We therefore would like to minimize our number of dependencies.Some HADES packages have a lot of dependencies, and one reason is that some direct dependencies have a large list of dependencies themselves. If possible, different HADES packages should have overlapping dependencies, to minimize the overall dependencies of HADES. Therefore, one effort could focus on analyzing the list of dependencies per HADES package, and the overlap between dependencies.
If possible, we'd like to restrict our set of dependencies to those that we believe are reliable. Criteria could be defined for what are considered 'reliable' dependencies, based for example on age and development state of the package, and an assessment of the community behind the package. For example, most of
tidyVerse
is probably reliable. Once criteria have been established, we could form a list of 'approved' packages we can depend on.The text was updated successfully, but these errors were encountered: