Changes to control model sparsity and MTSS improvements #331

mcloughlin2 · 2024-07-15T19:04:14Z

Improvements to MultitaskScaffoldSplitter:

Notable speedup of GA optimization
Added early stopping for optimization loop
Changed fitness function for Tanimoto distances to grade splits based on fraction of valid/test scaffolds with no train scaffold closer than given Tanimoto distance
Track evolution of fitness function terms during optimization to support split_diagnostics_plot function.

New module split_diagnostic_plots:

Provides single function to plot multiple aspects of split quality, given split parameters or ModelPipeline as input: Response value distributions by subset, Tanimoto distance histograms, actual vs requested split fractions, and fitness term evolution.
New individual functions to plot split fractions and fitness term evolution during GA optimization.

Sparsity-related parameters for XGBoost models:

xgb_alpha: Controls strength of L1 penalty in cost function
xgb_lambda: Controls strength of L2 penalty in cost function

New search domain parameters for hyperopt optimization of sparsity parameters:

wdp – for weight_decay_penalty
wdt – for weight_decay_penalty_type
xgba – for xgb_alpha
xgbb – for xgb_lambda (xgbl is already taken)

Feature_importance function to draw line plot of summed NN absolute feature weights vs epoch.

…n layer weights for each feature as it evolves over epochs, to see effect of setting weight decay parameters; (2) Added model parameters xgb_alpha and xgb_lambda and corresponding hyperopt parameters xgba and xgbb to control strength of L1 and L2 regularization penalties.

…hing on weight_decay_penalty and weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for xgb_alpha and xgb_lambda.

…penalty_type, xgb_alpha and xgb_lambda to set of model parameters displayed by the various compare_models functions.

… step(); changed order of operations so that grading happens at end of step() method rather than at beginning. Added serial_grade_population method for debugging. Simplified code to address some performance issues.

…eeing that it runs much faster than the multithreaded version. Added documentation.

…rity between test and training set scaffold structures. Fixed a bug where the splitter always returned the split from the last generation rather than the best-ever split. Added code to track the individual fitness function terms over generations so that they can be displayed in diagnostic plots.

…ore to the fitness_scores dictionary so that it can be plotted together with the component scores.

…specified number of generations. Replaced print() calls with log messages so we can control verbosity of output. Changed split() to use log_every_n argument to control frequency of messages during GA operation.

…nd XGBoost classification models now support class balancing weights when weight_transform_type parameter is set to 'balancing'.

… valid, test.

…ght absolute sums vs epoch, to assess effect of weight decay penalty.

…n layer weights for each feature as it evolves over epochs, to see effect of setting weight decay parameters; (2) Added model parameters xgb_alpha and xgb_lambda and corresponding hyperopt parameters xgba and xgbb to control strength of L1 and L2 regularization penalties.

…hing on weight_decay_penalty and weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for xgb_alpha and xgb_lambda.

…penalty_type, xgb_alpha and xgb_lambda to set of model parameters displayed by the various compare_models functions.

… step(); changed order of operations so that grading happens at end of step() method rather than at beginning. Added serial_grade_population method for debugging. Simplified code to address some performance issues.

…eeing that it runs much faster than the multithreaded version. Added documentation.

…nd XGBoost classification models now support class balancing weights when weight_transform_type parameter is set to 'balancing'.

… valid, test.

…ght absolute sums vs epoch, to assess effect of weight decay penalty.

paulsonak · 2024-08-01T00:39:11Z

Hi, I have tested the following so far mainly by using various tutorial notebooks and running the functions through there. I will update this in a bit when I am finished trying out the rest of the features.

Improvements to MultitaskScaffoldSplitter:

Notable speedup of GA optimization
Added early stopping for optimization loop
Changed fitness function for Tanimoto distances to grade splits based on fraction of valid/test scaffolds with no train scaffold closer than given Tanimoto distance
Track evolution of fitness function terms during optimization to support split_diagnostics_plot function.

Comments:

definitely way faster although there's a TensorFlow deprecation warning:

WARNING:tensorflow:From [/Users/apaulson/atomsci-venv/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:588](http://localhost:8888/lab/tree/repos/AMPL_umbrella/AMPL/atomsci/ddm/examples/tutorials/atomsci-venv/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py#line=587): calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version.
Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead

early stopping works as expected
updated grading function makes sense and the code looks fine. I had trouble following all the distance matrix calculations but the logic makes sense. (_generate_scaffold_dist_matrix, far_frac_fitness)

New module split_diagnostic_plots:

Provides single function to plot multiple aspects of split quality, given split parameters or ModelPipeline as input: Response value distributions by subset, Tanimoto distance histograms, actual vs requested split fractions, and fitness term evolution.
New individual functions to plot split fractions and fitness term evolution during GA optimization.

Comments:

plot_split_diagnostics() is great. I would recommend moving the tanimoto distance distribution plots to the end with the unweighted fitness scores. That way, all things that are per-task are plotted first and the user can specify num_cols=num_tasks for a neater plot.
It is confusing that axes can't be None for plot_split_fractions() and plot_fitness_terms(). You can just do a simple if axes is None: fig, ax = plt.subplots() check.

Sparsity-related parameters for XGBoost models:

xgb_alpha: Controls strength of L1 penalty in cost function
xgb_lambda: Controls strength of L2 penalty in cost function

New search domain parameters for hyperopt optimization of sparsity parameters:

wdp – for weight_decay_penalty
wdt – for weight_decay_penalty_type
xgba – for xgb_alpha
xgbb – for xgb_lambda (xgbl is already taken)

Feature_importance

function to draw line plot of summed NN absolute feature weights vs epoch.

mcloughlin2 added 17 commits May 6, 2024 23:45

Merge branch '1.6.1' into weight_decay_test

46a70ae

Added two new parameters, wdp and wdt, to allow hyperopt domain searc…

5d3c897

…hing on weight_decay_penalty and weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for xgb_alpha and xgb_lambda.

Added sparsity-related parameters weight_decay_penalty, weight_decay_…

512d8eb

…penalty_type, xgb_alpha and xgb_lambda to set of model parameters displayed by the various compare_models functions.

Changed to use single-threaded function to grade chromosomes, after s…

e4a5f62

…eeing that it runs much faster than the multithreaded version. Added documentation.

Normalize fitness scores to the range [0,1]. Add the total fitness sc…

be48c58

…ore to the fitness_scores dictionary so that it can be plotted together with the component scores.

Added early stopping to terminate GA if no fitness improvement after …

da5a121

…specified number of generations. Replaced print() calls with log messages so we can control verbosity of output. Changed split() to use log_every_n argument to control frequency of messages during GA operation.

Implemented enhancement request from AMPL issue #318: Random forest a…

035d283

…nd XGBoost classification models now support class balancing weights when weight_transform_type parameter is set to 'balancing'.

Save split_uuid in self.params at end of split_dataset().

b31769e

Added code to show valid & train Wasserstein distances in plot titles.

03b8642

Added functions to generate multi-plot displays to assess split quality.

39dc2c1

Fixed plot_split_fractions so that bars always appear in order train,…

8ea481a

… valid, test.

Merge branch 'master' into sparsity

35a8180

Added initial version of function to draw line plot of NN feature wei…

bc43bb5

…ght absolute sums vs epoch, to assess effect of weight decay penalty.

Removed old PRC plot function.

f6c7a44

mcloughlin2 requested review from stewarthe6 and paulsonak July 15, 2024 19:04

mcloughlin2 and others added 11 commits July 19, 2024 12:05

Added two new parameters, wdp and wdt, to allow hyperopt domain searc…

0f0eded

…hing on weight_decay_penalty and weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for xgb_alpha and xgb_lambda.

Added sparsity-related parameters weight_decay_penalty, weight_decay_…

7b5e99d

…penalty_type, xgb_alpha and xgb_lambda to set of model parameters displayed by the various compare_models functions.

Changed to use single-threaded function to grade chromosomes, after s…

120c6a1

…eeing that it runs much faster than the multithreaded version. Added documentation.

use the copy from sparsity branch

0741178

Implemented enhancement request from AMPL issue #318: Random forest a…

8d1e759

…nd XGBoost classification models now support class balancing weights when weight_transform_type parameter is set to 'balancing'.

Save split_uuid in self.params at end of split_dataset().

9297ffa

Added code to show valid & train Wasserstein distances in plot titles.

f6d220d

Added functions to generate multi-plot displays to assess split quality.

a431188

Fixed plot_split_fractions so that bars always appear in order train,…

2d91eb7

… valid, test.

mcloughlin2 and others added 5 commits July 19, 2024 12:11

Added initial version of function to draw line plot of NN feature wei…

4f48c6d

…ght absolute sums vs epoch, to assess effect of weight decay penalty.

Removed old PRC plot function.

c7f4938

Merge branch 'sparsity' of github.com:ATOMScience-org/AMPL into sparsity

a266200

Merge branch '1.7.0' of github.com:ATOMScience-org/AMPL into sparsity

d9df1f2

Merge branch 'master' of github.com:ATOMScience-org/AMPL into sparsity

e39fbd1

Merge branch '1.7.0' of github.com:ATOMScience-org/AMPL into sparsity

6196638

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes to control model sparsity and MTSS improvements #331

Changes to control model sparsity and MTSS improvements #331

mcloughlin2 commented Jul 15, 2024

paulsonak commented Aug 1, 2024

Changes to control model sparsity and MTSS improvements #331

Are you sure you want to change the base?

Changes to control model sparsity and MTSS improvements #331

Conversation

mcloughlin2 commented Jul 15, 2024

paulsonak commented Aug 1, 2024

Improvements to MultitaskScaffoldSplitter:

New module split_diagnostic_plots:

Sparsity-related parameters for XGBoost models:

New search domain parameters for hyperopt optimization of sparsity parameters:

Feature_importance