Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds learners table and overloads lrn #142

Closed
wants to merge 4 commits into from
Closed

Conversation

RaphaelS1
Copy link

This is related to #81 and #82. As discussed with @pat-s I am opening this PR here to discuss my current implementation for this functionality, though it need not necessarily live in this repo (and is likely far from an end version).

In this branch are:

  • Functionality for generating a table of all mlr3 learners. Ideally this would be automated in builds so that all required packages are installed and loaded with the latest versions. Note: I'm not sure if this is deliberate or not but mlr3learners.C50 does not currently work because it gets pulled as mlr3learners.C.50.
  • User-facing function for loading this table, with functionality for filtering along rows and columns. Also gives option to print as tibble or data.table
  • Overloads lrn function to first search the learner table to detect if the required mlr3 package is installed/attached. If this isn't the case then usethis is utilised to ask user if they want package installed. If so then learner is loaded, if not then errors. Probably makes more sense to move this functionality to mlr3misc::dictionary_sugar_get but this would also required learner table creation moved here, or there is a dependency loop. A vectorised version for lrns/mlr3misc::dictionary_sugar_mget follows naturally but is not included here.

Code is currently commented (not roxygen) and is expected to break builds on this branch (also probably not lintr friendly yet)

@lintr-bot

This comment has been minimized.

@mllg
Copy link
Member

mllg commented Jul 22, 2020

I'm not a fan of lrn() requiring internet access (available.packages()). There must be a better, more robust solution. We can discuss this during the sprint.

@RaphaelS1
Copy link
Author

I'm not a fan of lrn() requiring internet access (available.packages()). There must be a better, more robust solution. We can discuss this during the sprint.

lrn wouldn't use that directly, the idea would be for the table to be created automatically using ghactions (or similar), then lrn would only require internet access for install.packages() but this could simply error when internet access isn't available.

I agree we should discuss this properly during the sprint (and a good solution will help with the discussion about where to store learners/measures/pipelines0

@pat-s
Copy link
Member

pat-s commented Jul 22, 2020

I also do not understand why we should overload lrn() instead of having a dedicated function for this.

Could you add a reprex to the initial post showing the output?

@RaphaelS1
Copy link
Author

I also do not understand why we should overload lrn() instead of having a dedicated function for this.

Because then a user doesn't need to do anything more than the usually would to load a function, why make it harder for them? Also if the learner is already attached then there is no extra overhead.

Could you add a reprex to the initial post showing the output?

library(mlr3)
library(mlr3learners)
library(mlr3proba)
library(data.table)

extra_learners = rownames(
  available.packages(repos = "https://mlr3learners.github.io/mlr3learners.drat")
)
lapply(extra_learners, require, character.only = TRUE, quietly = TRUE)
keys = mlr_learners$keys()
all_lrns = suppressWarnings(mlr3::lrns(keys))
learner_table = data.table(t(rbindlist(list(mlr3misc::map(all_lrns, function(.x) {
  idsplt = strsplit(.x$id, ".", TRUE)[[1]]
  list(idsplt[[2]], idsplt[[1]], .x$id, strsplit(.x$man, "::", TRUE)[[1]][1],
       .x$packages[1], .x$properties, .x$feature_types, .x$predict_types)
})))))

colnames(learner_table) = c("name", "class", "id", "mlr3_package", "required_package",
                            "properties", "feature_types", "predict_types")
learner_table[, 1:4] = lapply(learner_table[, 1:4], as.character)
rm(all_lrns, extra_learners, keys)
list_mlr3learners = function(hide_cols = NULL, filter = NULL, tibble = FALSE) {
  
  dt = copy(learner_table)
  
  class = mlr3_package = required_package = NULL # hacky fix to prevent NOTE for global binding
  
  if (!is.null(filter)) {
    if (!is.null(filter$class)) {
      dt = subset(dt, class %in% filter$class)
    }
    if (!is.null(filter$mlr3_package)) {
      dt = subset(dt, mlr3_package %in% filter$mlr3_package)
    }
    if (!is.null(filter$required_package)) {
      dt = subset(dt, required_package %in% filter$required_package)
    }
    if (!is.null(filter$properties)) {
      dt = subset(dt, mlr3misc::map_lgl(dt$properties,
                                        function(.x) any(filter$properties %in% .x)))
    }
    if (!is.null(filter$feature_types)) {
      dt = subset(dt, mlr3misc::map_lgl(dt$feature_types,
                                        function(.x) any(filter$feature_types %in% .x)))
    }
    if (!is.null(filter$predict_types)) {
      dt = subset(dt, mlr3misc::map_lgl(dt$predict_types,
                                        function(.x) any(filter$predict_types %in% .x)))
    }
  }
  
  if (!is.null(hide_cols)) {
    dt = subset(dt, select = !(colnames(dt) %in% hide_cols))
  }
  
  if (tibble) {
    return(tibble::tibble(dt))
  } else {
    return(dt)
  }
}
lrn = function(.key, ...) {
  id = NULL
  pkg = unlist(subset(list_mlr3learners(), id == .key)$mlr3_package)
  inst = suppressWarnings(require(pkg, quietly = FALSE, character.only = TRUE))
  if (!inst) {
    ans = usethis::ui_yeah(
      sprintf("%s is not installed but is required, do you want to install this now?", pkg),
      n_no = 1
    )
    if (ans) {
      install.packages(pkg, repos = "https://mlr3learners.github.io/mlr3learners.drat")
    } else {
      stop(sprintf("%s is not installed but is required.", pkg))
    }
  }
  mlr3misc::dictionary_sugar_get(mlr_learners, .key, ...)
}

list_mlr3learners(hide_cols = c("properties", "feature_types"),
                                       filter = list(class = "surv", predict_types = "distr"))
#>           name class              id                 mlr3_package
#>  1:    akritas  surv    surv.akritas           mlr3learners.proba
#>  2:      coxph  surv      surv.coxph                    mlr3proba
#>  3:   flexible  surv   surv.flexible        mlr3learners.flexsurv
#>  4:     kaplan  surv     surv.kaplan                    mlr3proba
#>  5:     nelson  surv     surv.nelson        mlr3learners.survival
#>  6: obliqueRSF  surv surv.obliqueRSF      mlr3learners.obliquersf
#>  7: parametric  surv surv.parametric        mlr3learners.survival
#>  8:  penalized  surv  surv.penalized       mlr3learners.penalized
#>  9:      rfsrc  surv      surv.rfsrc mlr3learners.randomforestsrc
#> 10:      rpart  surv      surv.rpart                    mlr3proba
#>       required_package  predict_types
#>  1: mlr3learners.proba    crank,distr
#>  2:           survival distr,crank,lp
#>  3:           flexsurv distr,crank,lp
#>  4:           survival    crank,distr
#>  5:           survival    crank,distr
#>  6:         obliqueRSF    crank,distr
#>  7:           survival distr,lp,crank
#>  8:          penalized    distr,crank
#>  9:    randomForestSRC    crank,distr
#> 10:              rpart    crank,distr
list_mlr3learners(tibble = TRUE)
#> # A tibble: 91 x 8
#>    name  class id    mlr3_package required_package properties feature_types
#>    <chr> <chr> <chr> <chr>        <list>           <list>     <list>       
#>  1 AdaB… clas… clas… mlr3learner… <chr [1]>        <chr [2]>  <chr [3]>    
#>  2 C5    clas… clas… mlr3learner… <chr [1]>        <chr [4]>  <chr [3]>    
#>  3 cfor… clas… clas… mlr3learner… <chr [1]>        <chr [4]>  <chr [4]>    
#>  4 ctree clas… clas… mlr3learner… <chr [1]>        <chr [3]>  <chr [4]>    
#>  5 cv_g… clas… clas… mlr3learners <chr [1]>        <chr [3]>  <chr [3]>    
#>  6 debug clas… clas… mlr3         <chr [1]>        <chr [3]>  <chr [6]>    
#>  7 extr… clas… clas… mlr3learner… <chr [1]>        <chr [3]>  <chr [2]>    
#>  8 feat… clas… clas… mlr3         <chr [1]>        <chr [5]>  <chr [6]>    
#>  9 fnn   clas… clas… mlr3learner… <chr [1]>        <chr [2]>  <chr [2]>    
#> 10 gamb… clas… clas… mlr3learner… <chr [1]>        <chr [2]>  <chr [4]>    
#> # … with 81 more rows, and 1 more variable: predict_types <list>
lrn("regr.featureless")
#> <LearnerRegrFeatureless:regr.featureless>
#> * Model: -
#> * Parameters: robust=FALSE
#> * Packages: stats
#> * Predict Type: response
#> * Feature types: logical, integer, numeric, character, factor, ordered
#> * Properties: importance, missings, selected_features

Created on 2020-07-22 by the reprex package (v0.3.0)

Copy link
Member

@pat-s pat-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I really like the overall idea and think is an important addition.
We should discuss about the details though but I am sure we'll get there and find a good minimal approach.

See the comments in-line in addition.

Because then a user doesn't need to do anything more than the usually would to load a function, why make it harder for them?

This is good if one knows that {mlr3learners} overloads it. But this is not obvious. I see confusion incoming when people do not have mlr3learners loaded and call lrns() and get a different behavior for plain {mlr3}.

If we can get a robust implementation, this could then also be an addition to lrns() in mlr3 directly?

@@ -56,6 +56,7 @@ Suggests:
ranger,
rmarkdown,
testthat,
tibble,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not pull in tibble as a soft dep.
The following overloads the printer function for data.frame (can be extended for data.tables) and can live in everyone's .Rprofile:

# tibble > data.frame
if (interactive() && "tibble" %in% rownames(utils::installed.packages())) {
  print.data.frame = function(x, ...) {
    tibble:::print.tbl(tibble::as_tibble(x), ...)
  }
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't CRAN give a warning/note when using triple colon in packages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not for inclusion in a package, this is for a local .Rprofile.

Comment on lines +10 to +12
extra_learners = rownames(
available.packages(repos = "https://mlr3learners.github.io/mlr3learners.drat")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Wrap in tryCatch() because it should not block if no internet is available.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is automated and created during builds then surely there is always internet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did not see that this is not part of the function.

We could possibly store it in the package though the content then depends on the version which users have installed.

I guess querying an online resource (could also be the mlr3 GH repo) while requiring internet access would be better? What about both: Trying to query the online table and fall back to the local one included in the package (with a warning message that this one might not include all learners).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess querying an online resource (could also be the mlr3 GH repo) while requiring internet access would be better?

What's/where's the mlr3 GH repo?

all back to the local one included in the package (with a warning message that this one might not include all learners).

I think this would slightly defeat the point because say a user wants to install xgboost but does not know that it lives in mlr3learners and have therefore only installed mlr3. Then the table will not show xgboost nor will it be able to install it if called.

Unless you just mean only fall back to local when internet is not available? I guess that would make sense and be more intuitive than just erroring. Assuming the local one is identical to the code above except with no call to install.packages/available.packages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you just mean only fall back to local when internet is not available?

I meant this, yes.

What's/where's the mlr3 GH repo?

The mlr3 GitHub repo.

Outlining the process again:

  1. lrns() queries an online resource which could live in the mlr3data repo (to not bloat mlr3 with too much CI updates) GitHub repo. This resource gets generated once a day.
  2. If no internet access is available, a static version included in the mlr3 package installation is used. This option included a warning that a static version was used.
  3. In every mlr3 release, mlr3 ships with the most recent version of this static learner table.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry completely misread and thought you were saying there's a separate repo just for certain GHaction automations.
I completely agree with the process outlined above. But would this not result in mlr3data becoming an import not suggest as the lrn/lrns functions would depend on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The latest version of the learner table (which would be updated by the CI every day) would live in mlr3data and would be queried via the web
  • The static version would live in mlr3 and ship with the package

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good and just so I understand would the static version just have to be manually typed out and updated. Do you want me to try and set-up the build for mlr3data so we can close the first part (i.e. the online table)?

Copy link
Member

@pat-s pat-s Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First I would like to ask @mllg if he agrees that mlr3data would be a good place to store a .csv file containing this information?

That sounds good and just so I understand would the static version just have to be manually typed out

Not sure I understand what you mean by this.

  1. Read the .csv from mlr3data into an .rda file into mlr3
  2. Ship this .rda file (static snapshot) with an mlr3 release and use it as the static fallback

Apart from both you can continue to write a .csv containing the table that should be read in later.
Oh wait - is a .csv suitable? We have a nested structure here, don't we? We could write a JSON?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, sorry I have no experience with external data structures in R so just trying to visualise it, but understand it now :)

R/mlr3learners_table.R Outdated Show resolved Hide resolved
R/mlr3learners_table.R Outdated Show resolved Hide resolved
@RaphaelS1
Copy link
Author

Because then a user doesn't need to do anything more than the usually would to load a function, why make it harder for them?

This is good if one knows that {mlr3learners} overloads it. But this is not obvious. I see confusion incoming when people do not have mlr3learners loaded and call lrns() and get a different behavior for plain {mlr3}.

If we can get a robust implementation, this could then also be an addition to lrns() in mlr3 directly?

Sorry should have made this clear, I don't actually think lrn/lrns should be in mlr3learners but didn't make sense to me to PR somewhere else. Ideally I think table should be made automatically and all functionality could live in mlr3

@pat-s
Copy link
Member

pat-s commented Jul 25, 2020

Sorry should have made this clear, I don't actually think lrn/lrns should be in mlr3learners but didn't make sense to me to PR somewhere else. Ideally I think table should be made automatically and all functionality could live in mlr3

No worries, I also have a better overview of the whole approach now. Should have suggested this in the first place probably. It is not a problem, we can simply open the final version once we are done iterating here :)

@pat-s
Copy link
Member

pat-s commented Aug 15, 2020

@RaphaelS1 What is the status here?

@RaphaelS1
Copy link
Author

@RaphaelS1 What is the status here?

this is now implemented in mlr3extralearners, either it will stay in there if we use that or I can port it into another repo. I'll close this PR

@RaphaelS1 RaphaelS1 closed this Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants