Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to control model sparsity and MTSS improvements #331

Open
wants to merge 34 commits into
base: 1.7.0
Choose a base branch
from

Conversation

mcloughlin2
Copy link
Collaborator

Improvements to MultitaskScaffoldSplitter:

  • Notable speedup of GA optimization
  • Added early stopping for optimization loop
  • Changed fitness function for Tanimoto distances to grade splits based on fraction of valid/test scaffolds with no train scaffold closer than given Tanimoto distance
  • Track evolution of fitness function terms during optimization to support split_diagnostics_plot function.

New module split_diagnostic_plots:

  • Provides single function to plot multiple aspects of split quality, given split parameters or ModelPipeline as input: Response value distributions by subset, Tanimoto distance histograms, actual vs requested split fractions, and fitness term evolution.
  • New individual functions to plot split fractions and fitness term evolution during GA optimization.

Sparsity-related parameters for XGBoost models:

  • xgb_alpha: Controls strength of L1 penalty in cost function
  • xgb_lambda: Controls strength of L2 penalty in cost function

New search domain parameters for hyperopt optimization of sparsity parameters:

  • wdp – for weight_decay_penalty
  • wdt – for weight_decay_penalty_type
  • xgba – for xgb_alpha
  • xgbb – for xgb_lambda (xgbl is already taken)

Feature_importance function to draw line plot of summed NN absolute feature weights vs epoch.

…n layer weights for each feature as it evolves

over epochs, to see effect of setting weight decay parameters; (2) Added model parameters xgb_alpha and xgb_lambda
and corresponding hyperopt parameters xgba and xgbb to control strength of L1 and L2 regularization penalties.
…hing on weight_decay_penalty and

weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for
xgb_alpha and xgb_lambda.
…penalty_type, xgb_alpha and xgb_lambda to set

of model parameters displayed by the various compare_models functions.
… step(); changed order of operations so that grading

happens at end of step() method rather than at beginning. Added serial_grade_population method for debugging. Simplified
code to address some performance issues.
…eeing that it runs much faster than

the multithreaded version. Added documentation.
…rity between test and training

set scaffold structures. Fixed a bug where the splitter always returned the split from the last
generation rather than the best-ever split. Added code to track the individual fitness function terms
over generations so that they can be displayed in diagnostic plots.
…ore to the fitness_scores dictionary so that it can

be plotted together with the component scores.
…specified number of generations. Replaced print() calls

with log messages so we can control verbosity of output. Changed split() to use log_every_n argument to control frequency
of messages during GA operation.
…nd XGBoost classification models now support

class balancing weights when weight_transform_type parameter is set to 'balancing'.
…ght absolute sums vs epoch,

to assess effect of weight decay penalty.
mcloughlin2 and others added 11 commits July 19, 2024 12:05
…n layer weights for each feature as it evolves

over epochs, to see effect of setting weight decay parameters; (2) Added model parameters xgb_alpha and xgb_lambda
and corresponding hyperopt parameters xgba and xgbb to control strength of L1 and L2 regularization penalties.
…hing on weight_decay_penalty and

weight_decay_penalty_type parameters. Completed implementation of hyperopt domain specification for
xgb_alpha and xgb_lambda.
…penalty_type, xgb_alpha and xgb_lambda to set

of model parameters displayed by the various compare_models functions.
… step(); changed order of operations so that grading

happens at end of step() method rather than at beginning. Added serial_grade_population method for debugging. Simplified
code to address some performance issues.
…eeing that it runs much faster than

the multithreaded version. Added documentation.
…nd XGBoost classification models now support

class balancing weights when weight_transform_type parameter is set to 'balancing'.
@paulsonak
Copy link
Collaborator

Hi, I have tested the following so far mainly by using various tutorial notebooks and running the functions through there. I will update this in a bit when I am finished trying out the rest of the features.

Improvements to MultitaskScaffoldSplitter:

  • Notable speedup of GA optimization
  • Added early stopping for optimization loop
  • Changed fitness function for Tanimoto distances to grade splits based on fraction of valid/test scaffolds with no train scaffold closer than given Tanimoto distance
  • Track evolution of fitness function terms during optimization to support split_diagnostics_plot function.

Comments:

  • definitely way faster although there's a TensorFlow deprecation warning:
WARNING:tensorflow:From [/Users/apaulson/atomsci-venv/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py:588](http://localhost:8888/lab/tree/repos/AMPL_umbrella/AMPL/atomsci/ddm/examples/tutorials/atomsci-venv/lib/python3.9/site-packages/tensorflow/python/util/deprecation.py#line=587): calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version.
Instructions for updating:
experimental_relax_shapes is deprecated, use reduce_retracing instead
  • early stopping works as expected
  • updated grading function makes sense and the code looks fine. I had trouble following all the distance matrix calculations but the logic makes sense. (_generate_scaffold_dist_matrix, far_frac_fitness)

New module split_diagnostic_plots:

  • Provides single function to plot multiple aspects of split quality, given split parameters or ModelPipeline as input: Response value distributions by subset, Tanimoto distance histograms, actual vs requested split fractions, and fitness term evolution.
  • New individual functions to plot split fractions and fitness term evolution during GA optimization.

Comments:

  • plot_split_diagnostics() is great. I would recommend moving the tanimoto distance distribution plots to the end with the unweighted fitness scores. That way, all things that are per-task are plotted first and the user can specify num_cols=num_tasks for a neater plot.
  • It is confusing that axes can't be None for plot_split_fractions() and plot_fitness_terms(). You can just do a simple if axes is None: fig, ax = plt.subplots() check.

Sparsity-related parameters for XGBoost models:

  • xgb_alpha: Controls strength of L1 penalty in cost function
  • xgb_lambda: Controls strength of L2 penalty in cost function

New search domain parameters for hyperopt optimization of sparsity parameters:

  • wdp – for weight_decay_penalty
  • wdt – for weight_decay_penalty_type
  • xgba – for xgb_alpha
  • xgbb – for xgb_lambda (xgbl is already taken)

Feature_importance

  • function to draw line plot of summed NN absolute feature weights vs epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants