Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding seed for reproducibility and sampling methods #344

Open
wants to merge 50 commits into
base: 1.7.0
Choose a base branch
from

Conversation

rwilfong
Copy link
Collaborator

@rwilfong rwilfong commented Jul 31, 2024

Hi, this pull request introduces two new feature additions to AMPL:

  1. imbalance-learn Sampling

Users now have an option to use SMOTE or RandomUnderSampler from imbalance-learn on their classification datasets. The module to incorporate sampling was created under atomsci/ddm/pipeline/sampling.py. It takes the input training dataset and applies the sampling strategy. It then updates the IDs and weights and returns the dataset. The module is integrated in model_pipeline.py under load_featurized_data and only runs if it is a classification model and a sampling method is defined in the parameters dictionary (default is None).

Because the sampling strategy was applied to each fold in k-fold cv splitting, Stewart updated combined_training_data in model_datasets.py to incorporate the data from each fold (rather than the prior assumption that the same data was shared across each fold’s training and validation datasets).

Tests were created under atomsci/ddm/test/integrative/sampling_test. It rather exhaustive by testing each model type (RF, NN, XGBoost) and each sampling method. It is based off of the test_kfold_split.py test.

  1. Seed for Reproducibility

Each model trained now has a seed associated with it so it can be reproduced given the seed and exact same hyperparameters and data. This seed is created from a new script called random_seed.py under atomsci/ddm/pipeline that sets a random seed and creates a random number generator when called. The script will parse the seed from the parameter parser Namespace object and if it is None, it will generate a seed using the uuid library. It will set the seed for NumPy, TensorFlow, random, and PyTorch. The module gets imported into model_pipeline.py and a seed is initialized in the __init__. It also takes input for a seed and random state in case a user initializes ModelPipeline from a different function and wants to use a pre-existing seed and random state. The seed is passed into the DeepChem models and splitting modules.

In model_pipeline.py I also added two new functions: create_split_metadata and save_split_metadata. These functions create a split_metadata.json for each split dataset and save it under the output directory. It writes the seed and the parameters so that the exact split is reproducible.

There are two types of tests saved under atomsci/ddm/test/integrative/seed_test/. The first type tests the reproducibility of split datasets and is executed using seed_test_splitting.py. It tests each split type (scaffold, random, fingerprint) and split strategy (train_valid_test, k_fold_cv). The second type tests the reproducibility of created models and is executed using seed_test_models.py. This tests the recreation of each model type (RF, NN, XGBoost) and prediction type (regression, classification).

I've ran all the tests and used the ruff linter to check for any problems and everything currently works on my end.

@stewarthe6
Copy link
Collaborator

Before merging this I'd like to add some reproducibility tests for graphconv models and the other automatically added deepchem models.

I'd also like to set seeds for models that are tested so we can avoid tests failing because a model didn't do as well as expected.

stewarthe6 and others added 18 commits September 24, 2024 08:37
…. Set response column to 'active' since that's the classification column. Added warning for when the expected number of classes doesn't match the number classes found
…a and KFoldClassificationPerfData. Updated the parameters and documentation to match behavior. New tests for SimpleClassificationPerfData and SimpleRegressionPerfData
… to see that there is more of the major class
@stewarthe6
Copy link
Collaborator

Ok, I think that this PR is ready to be reviewed again.

Copy link
Collaborator

@paulsonak paulsonak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it all makes sense, but I have several questions in-line that might be worth addressing before merging.

Overall, this was a huge PR so in the future we should try to break things up, such as seeds in one and sampling methods in another, maybe k-fold as a third. git cherrypick is really useful to move certain commits from one branch to another.

The testing suite looks extensive (huge). I guess this is good, although it might be redundant. Is there a way we can reduce the integrative tests and increase the unit tests to make it simpler?

@@ -1039,6 +1044,7 @@ def parse_args():
parser.add_argument('id_col', type=str, help='the column containing ids')
parser.add_argument('response_cols', type=str, help='comma seperated string of response columns')
parser.add_argument('output', type=str, help='name of the split file')
parser.add_argument('seed', type=int, default=0, help='name of the split file')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have an updated help statement

@@ -655,11 +661,31 @@ def combined_training_data(self):
# All of the splits have the same combined train/valid data, regardless of whether we're using
# k-fold or train/valid/test splitting.
if self.combined_train_valid_data is None:
# normally combining one fold is sufficient, but if SMOTE or undersampling is being used
# just combining the first fold isn't enough
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For undersampling, it looks like it assumes that K-fold undersampling would sample the entire non-test dataset. What if this isn't the case? Is this assumption ensured elsewhere?

self.data.save_split_dataset()
# write split metadata
self.create_split_metadata()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the random seed get saved into model metadata? If you wanted to retrain a model 10 times to see how variable the predictions are, would you end up training a model with the same random seed or 10 different ones?

@@ -1994,7 +2000,7 @@ def make_dc_model(self, model_dir):
reg_lambda=1,
scale_pos_weight=1,
base_score=0.5,
random_state=0,
random_state= self.seed, #0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the #0's can be removed probably

@@ -697,7 +723,8 @@ def get_subset_responses_and_weights(self, subset, transformers):
"""
if subset not in self.subset_response_dict:
if subset in ('train', 'valid', 'train_valid'):
dataset = self.combined_training_data()
for fold, (train, valid) in enumerate(self.train_valid_dsets):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are just looking for new compounds in each fold, can you just concatenate all train_valid_dsets and then call set(ids) or drop_duplicates() or something? Might make the code more efficient than multiple for loops, but I'm not sure if it is actually easier based on the way the datasets are stored.


#adjust weights and ids
resampled_indices = undersampler.sample_indices_
resampled_weights = train.w[resampled_indices]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the weights all be 1 if undersampling makes the # of classes even

# set the new weights equal to 1
average_weight = 1 #np.mean(train.w)
synthetic_weights=np.full((num_synthetic,1), average_weight, dtype=np.float64)
resampled_weights=np.concatenate([train.w, synthetic_weights])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the weights need to be fully recalculated according to the weighting strategy specified. if you have 10% class 1 and 90% class 0 and then add class 1 smote so it's 30% and 70%, then if you're doing balanced class weights you need to change all of them to .3/.7 instead of having .1, .9 and 1.

@@ -0,0 +1,15 @@
{"verbose": "True",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the seed_test jsons supposed to pass a seed in?

pparams.split_uuid = split_uuid
return pparams

def extract_seed(metadata_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these functions be written twice or should they be imported from one test to another

def test_SimpleRegressionPerfData():
res_dir, tmp_dskey = setup_paths()

params = {"verbose": "True",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the convention is now to create a json file in a separate folder to encapsulate this info

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants