Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up the Highly Adaptive Lasso with recursive screening via MARS #105

Open
wants to merge 53 commits into
base: devel
Choose a base branch
from

Conversation

Larsvanderlaan
Copy link
Collaborator

@Larsvanderlaan Larsvanderlaan commented Jan 3, 2023

Implements a variant of the selectively adaptive lasso (SAL) that uses MARS (earth) to learn important variables and important variable interactions. While MARS is used by SAL to select important variables and interaction variable subgroups, MARS is not used by SAL for selecting specific spline basis functions. In particular, all basis functions, as specified by the params max_degree and num_knots, are generated for the variables and variable subgroups found by MARS. So SAL tends to be much more expressive than MARS.

The earth model is fit with its own internal cross-validation method for pruning (so not the default gcv approach).
Also, earth parameters for the number of basis functions generated/searched are set high/maximal.

Nested cross-validation is implemented to take into account the outcome-dependent variable selection. SAL is able to handle both large n and large p very well and can lead to both substantial speedups and better performance than HAL when there are many noisy variables.

-- I changed defaults of num_knots argument so it varies as a function of sample size.
-- I made a minor change to how basis functions are generated with the num_knots argument. Before, an edge basis function was being generated that could lead to instability in the CV sometimes.
-- The formula bug fixes of #101 are also incorporated here.

Things still to do:
Write some tests
a bit more documentation
minor changes to make sure fit_hal and fit_sal have near identical functionality

@Larsvanderlaan
Copy link
Collaborator Author

Im getting the error ``Error: Unable to resolve action r-lib/actions@master, unable to find version `master`" in the code check. Anyone have any idea what this is?

@nhejazi
Copy link
Member

nhejazi commented Jan 3, 2023

Looks interesting, Lars --- I'd be curious to learn more about this. Is your intention to create a separate fit_sal() function, paralleling fit_hal(), or to have this alternative algorithm available via the same constructor?

For the question about the failing builds, this is because the GitHub Actions file is out of date, as r-lib/actions@master doesn't point to a valid branch anymore. Please change the relevant line in .github/workflows/R-CMD-check.yml to https://github.com/r-lib/actions#releases-and-tags

@Larsvanderlaan
Copy link
Collaborator Author

Larsvanderlaan commented Jan 3, 2023

Hi @nhejazi, thank you for the help! The current implementation has separate functions, mainly because fit_sal() calls fit_hal() internally. I am fine with having a single constructor function, and I think this would be easier for users as well. Though, it might be cleanest to have the fitting parts of fit_hal() and fit_sal() be separate internal functions. What do you think?

If the functions are merged, we could add options screen_variables = TRUE and screen_interactions = TRUE.
We could also make a screening function an argument to fit_hal so that users can specify arbitrary screening algorithms and the internal CV handles it appropriately. glmnet has a similar feature with the exclude argument. However, since we want to exclude variables and not specific basis functions we can't use the glmnet implementation.
The fit_sal() implementation can be adjusted to work with any screening function that outputs either a hal_formula or a vector of column indices.

@Larsvanderlaan
Copy link
Collaborator Author

Larsvanderlaan commented Jan 4, 2023

I have incorporated fit_sal into fit_hal through the arguments: screen_variables, screen_interactions, and screener_max_degree. At the moment, all the original tests pass with screen_variables = TRUE and screen_interactions = TRUE as default.

@nhejazi
Copy link
Member

nhejazi commented Jan 5, 2023

thanks, @Larsvanderlaan, for these changes. thinking through it some more, i think it may be a good idea to keep the constructors for the two separate, as in retaining both fit_sal() and fit_hal() as user-facing functions. this way, there could be a distinct vignette for SAL, which could reference the pre-print on that algorithm. the advantage here is that there would be greater clarity of the user as to where to look for guidance on the given choice of algorithm (especially since SAL's construction differs somewhat from the more familiar LASSO construction of HAL). beyond that, i think it's important to avoid "argument creep" in the constructor, which we worked against in one of the major version changes. does this sound like a workable path forward (and sorry for the extra work in separating the two functions)?

also, a few comments on the construction of the PR:

  • in future, please try to use at least somewhat descriptive commit messages, instead of the repetitive messages seen when glancing above. this is important for keeping commits modular (by thinking of the message before you submit the change), so that bugs can be identified and so that reviews are possible to conduct. in this instance, we'll squash the commits at the merge phase, so that they disappear in the history, but that's generally a practice we should avoid.
  • as you get to the stage of wrapping up the commit, please bump the version of the package in DESCRIPTION and add information about the various changes (which you already very helpfully summarize in your first comment) in the NEWS file, taking care to follow the style already used in that document

@Larsvanderlaan
Copy link
Collaborator Author

Larsvanderlaan commented Jan 5, 2023

Hi @nhejazi , I should note that the SAL implementation here is totally different from the algorithm in the preprint. Essentially I use MARS to learn a formula_hal object of the form ~ h(X1) + h(X2, X1) + h(X3, X2, X1) and then the standard HAL implementation provided by fit_hal is used with this formula. The basis functions are still generated using the original version of fit_hal and the arguments max_degree and num_knots. So MARS is really just used as a variable screener. I called it SAL because it has the same goal as the preprint (selectively choosing basis functions/variables) but accomplishes its goal without fundamentally changing HAL. For clarity, I will go ahead and remove the name SAL. Also, MARS uses the same basis functions as first-order HAL but fits them in a greedy manner. So even as a screening algorithm, MARS stays close to what HAL does.

I now personally think it should be a part of the fit_hal method due to the substantial speedup and performance gains I am seeing in simulations (including the datasets from ACIC2019). Regarding argument creep, that is a good point and to address this I can add a single ``screener_control" list argument that contains all the relevant arguments.

Also, thank you very much for the other comments. I agree that more informative commit names are a better route to go.

@nhejazi
Copy link
Member

nhejazi commented Jan 6, 2023

That's very helpful clarification, @Larsvanderlaan. I had taken the mention of SAL to reference the pre-print, and I see from your comment that it's similar, but the fact that it is a screening-based variant of the already well-studied HAL (and uses the same underlying functionality implemented in the package) makes a convincing case for it being included within the existing fit_hal() constructor. So, I think we agree on the design choice you've made, and I think removing the SAL name is helpful since it would be confusing to reference an algorithm other than what's implemented (maybe you could call it "filtered HAL" or "MARS HAL" or similar).

Also, I think following the style set in the constructor should be good enough to avoid argument creep, since by adding these arguments as a list-argument in screener_control, you'd only be adding a single argument, the components of which would be documented in the internal function to which they get passed. And that's very exciting about the speedup! Personally, I'll be curious to try this out in the settings where I've run into significant performance issues (e.g., hazard-based density estimation).

@Larsvanderlaan Larsvanderlaan changed the title Selectively adaptive lasso with MARS Speeding up the Highly Adaptive Lasso with recursive screening via MARS Mar 22, 2023
@rachaelvp
Copy link
Member

Hi @Larsvanderlaan I just merged your other PR (fix to quantile binning) into devel, leading to a merge conflict with this PR. Could you please address this so we can also merge this PR into devel?

Also, @nhejazi do you agree this PR is ready for merging once the conflict is addressed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants