Adding seed for reproducibility and sampling methods #344

rwilfong · 2024-07-31T19:47:28Z

Hi, this pull request introduces two new feature additions to AMPL:

imbalance-learn Sampling

Users now have an option to use SMOTE or RandomUnderSampler from imbalance-learn on their classification datasets. The module to incorporate sampling was created under atomsci/ddm/pipeline/sampling.py. It takes the input training dataset and applies the sampling strategy. It then updates the IDs and weights and returns the dataset. The module is integrated in model_pipeline.py under load_featurized_data and only runs if it is a classification model and a sampling method is defined in the parameters dictionary (default is None).

Because the sampling strategy was applied to each fold in k-fold cv splitting, Stewart updated combined_training_data in model_datasets.py to incorporate the data from each fold (rather than the prior assumption that the same data was shared across each fold’s training and validation datasets).

Tests were created under atomsci/ddm/test/integrative/sampling_test. It rather exhaustive by testing each model type (RF, NN, XGBoost) and each sampling method. It is based off of the test_kfold_split.py test.

Seed for Reproducibility

Each model trained now has a seed associated with it so it can be reproduced given the seed and exact same hyperparameters and data. This seed is created from a new script called random_seed.py under atomsci/ddm/pipeline that sets a random seed and creates a random number generator when called. The script will parse the seed from the parameter parser Namespace object and if it is None, it will generate a seed using the uuid library. It will set the seed for NumPy, TensorFlow, random, and PyTorch. The module gets imported into model_pipeline.py and a seed is initialized in the __init__. It also takes input for a seed and random state in case a user initializes ModelPipeline from a different function and wants to use a pre-existing seed and random state. The seed is passed into the DeepChem models and splitting modules.

In model_pipeline.py I also added two new functions: create_split_metadata and save_split_metadata. These functions create a split_metadata.json for each split dataset and save it under the output directory. It writes the seed and the parameters so that the exact split is reproducible.

There are two types of tests saved under atomsci/ddm/test/integrative/seed_test/. The first type tests the reproducibility of split datasets and is executed using seed_test_splitting.py. It tests each split type (scaffold, random, fingerprint) and split strategy (train_valid_test, k_fold_cv). The second type tests the reproducibility of created models and is executed using seed_test_models.py. This tests the recreation of each model type (RF, NN, XGBoost) and prediction type (regression, classification).

I've ran all the tests and used the ruff linter to check for any problems and everything currently works on my end.

��:wq��Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0

…nction.

stewarthe6 · 2024-09-12T18:06:12Z

Before merging this I'd like to add some reproducibility tests for graphconv models and the other automatically added deepchem models.

I'd also like to set seeds for models that are tested so we can avoid tests failing because a model didn't do as well as expected.

… test scripts

…r and updated the test_split to use a fixed seed

…method was used

…. Set response column to 'active' since that's the classification column. Added warning for when the expected number of classes doesn't match the number classes found

…into 1.7.0

…nged

…ds in each run is the same

…a and KFoldClassificationPerfData. Updated the parameters and documentation to match behavior. New tests for SimpleClassificationPerfData and SimpleRegressionPerfData

… to see that there is more of the major class

stewarthe6 · 2024-10-01T17:52:10Z

Ok, I think that this PR is ready to be reviewed again.

paulsonak

I think it all makes sense, but I have several questions in-line that might be worth addressing before merging.

Overall, this was a huge PR so in the future we should try to break things up, such as seeds in one and sampling methods in another, maybe k-fold as a third. git cherrypick is really useful to move certain commits from one branch to another.

The testing suite looks extensive (huge). I guess this is good, although it might be redundant. Is there a way we can reduce the integrative tests and increase the unit tests to make it simpler?

paulsonak · 2024-10-14T22:11:34Z

atomsci/ddm/pipeline/MultitaskScaffoldSplit.py

@@ -1039,6 +1044,7 @@ def parse_args():
    parser.add_argument('id_col', type=str, help='the column containing ids')
    parser.add_argument('response_cols', type=str, help='comma seperated string of response columns')
    parser.add_argument('output', type=str, help='name of the split file')
+    parser.add_argument('seed', type=int, default=0, help='name of the split file')


This should have an updated help statement

paulsonak · 2024-10-14T22:21:47Z

atomsci/ddm/pipeline/model_datasets.py

@@ -655,11 +661,31 @@ def combined_training_data(self):
        # All of the splits have the same combined train/valid data, regardless of whether we're using
        # k-fold or train/valid/test splitting.
        if self.combined_train_valid_data is None:
+            # normally combining one fold is sufficient, but if SMOTE or undersampling is being used
+            # just combining the first fold isn't enough


For undersampling, it looks like it assumes that K-fold undersampling would sample the entire non-test dataset. What if this isn't the case? Is this assumption ensured elsewhere?

paulsonak · 2024-10-14T22:25:45Z

atomsci/ddm/pipeline/model_pipeline.py

                self.data.save_split_dataset()
+                # write split metadata
+                self.create_split_metadata()


Does the random seed get saved into model metadata? If you wanted to retrain a model 10 times to see how variable the predictions are, would you end up training a model with the same random seed or 10 different ones?

paulsonak · 2024-10-14T22:31:09Z

atomsci/ddm/pipeline/model_wrapper.py

@@ -1994,7 +2000,7 @@ def make_dc_model(self, model_dir):
                                         reg_lambda=1,
                                         scale_pos_weight=1,
                                         base_score=0.5,
-                                         random_state=0,
+                                         random_state= self.seed, #0,


the #0's can be removed probably

paulsonak · 2024-10-14T22:39:00Z

atomsci/ddm/pipeline/model_datasets.py

@@ -697,7 +723,8 @@ def get_subset_responses_and_weights(self, subset, transformers):
        """
        if subset not in self.subset_response_dict:
            if subset in ('train', 'valid', 'train_valid'):
-                dataset = self.combined_training_data()
+                for fold, (train, valid) in enumerate(self.train_valid_dsets):


If you are just looking for new compounds in each fold, can you just concatenate all train_valid_dsets and then call set(ids) or drop_duplicates() or something? Might make the code more efficient than multiple for loops, but I'm not sure if it is actually easier based on the way the datasets are stored.

paulsonak · 2024-11-06T20:46:38Z

atomsci/ddm/pipeline/sampling.py

+
+        #adjust weights and ids 
+        resampled_indices = undersampler.sample_indices_
+        resampled_weights = train.w[resampled_indices]


shouldn't the weights all be 1 if undersampling makes the # of classes even

paulsonak · 2024-11-06T20:48:39Z

atomsci/ddm/pipeline/sampling.py

+        # set the new weights equal to 1
+        average_weight = 1 #np.mean(train.w)
+        synthetic_weights=np.full((num_synthetic,1), average_weight, dtype=np.float64)
+        resampled_weights=np.concatenate([train.w, synthetic_weights])


i think the weights need to be fully recalculated according to the weighting strategy specified. if you have 10% class 1 and 90% class 0 and then add class 1 smote so it's 30% and 70%, then if you're doing balanced class weights you need to change all of them to .3/.7 instead of having .1, .9 and 1.

paulsonak · 2024-11-06T20:56:51Z

atomsci/ddm/test/integrative/seed_test/model_json/xgboost_regression_train_valid_test.json

@@ -0,0 +1,15 @@
+{"verbose": "True",


are the seed_test jsons supposed to pass a seed in?

paulsonak · 2024-11-06T20:58:49Z

atomsci/ddm/test/integrative/seed_test/seed_test_splitting.py

+    pparams.split_uuid = split_uuid
+    return pparams
+
+def extract_seed(metadata_path):


should these functions be written twice or should they be imported from one test to another

paulsonak · 2024-11-06T21:29:41Z

atomsci/ddm/test/unit/test_perf_data.py

+def test_SimpleRegressionPerfData():
+    res_dir, tmp_dskey = setup_paths()
+
+    params = {"verbose": "True",


seems like the convention is now to create a json file in a separate folder to encapsulate this info

rwilfong and others added 6 commits July 31, 2024 12:03

sampling and seed

e4d8871

now it runs

22b0318

kfold changes

30ea360

seed test

dc1f7c4

ruff linter suggestions

7b13967

updated kfoldregression

6fb1c62

rwilfong mentioned this pull request Aug 1, 2024

Merging Reproducibility and Sampling Features #346

Merged

stewarthe6 and others added 13 commits September 11, 2024 10:55

Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0

480e5f1

added imblearn to pip requirements

fc24463

unpin imblearn

561c3bb

Clean up unused random_state or seed parameters or assignments.

49dc67b

fixed merging error

b41b7d5

Fixed find and replace bug

b65ba09

make_dc_model does not need random_state or seed arguments

84babd2

fh��new changes

ecf23bd

��:wq��Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0

Changed constructor of ProductionSplitter to call Splitting's init fu…

a821f6a

…nction.

resolving errors

319b2f0

removed heads

31f3d5f

removed unused library

d074f65

Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0

b0ecc05

stewarthe6 added 9 commits September 12, 2024 11:56

Added more models for seeding test.

2992bdf

Fixed seed for GCNModel. Should pass regularly now.

ccebaed

Set seed to guarantee resuts in class_config_delaney_fit_nn_ecfp.json

dcc4809

Moved 'test' from suffix to prefix

922bf0c

Renamed these test files to start with test_ so they're caught by the…

82838d1

… test scripts

Changed MultitaskScaffoldSplit and GeneticAlgorithm to use a Generate…

4e471cb

…r and updated the test_split to use a fixed seed

Added test for MTSS seed and fixed a few cases were the wrong random …

baa5478

…method was used

renamed this file to match wahts in test_seed_splitting.py

4eb4ee4

renamed this to match the test

4588a9d

stewarthe6 and others added 18 commits September 24, 2024 08:37

Removed try except blocks in test code. We need to see these errors

ff58d02

Added seed to this test so that it passes more consistently

0028ed7

combined_training_data now accounts for synthetic datasets

0c83b6b

accept changes

ada3ea8

integrate changes

4dd5d99

set uncertainty false for classification test since it is unsupported…

0a616b2

…. Set response column to 'active' since that's the classification column. Added warning for when the expected number of classes doesn't match the number classes found

update branchMerge branch '1.7.0' of https://github.com/rwilfong/AMPL …

16c2a4a

…into 1.7.0

updated tests

c3b1922

resolve errors

f2a30a9

Added seed to test_balancing_transformer for more consistent outputs

410f03d

added a test to make sure that multitask problems don't work with SMOTE

f247893

Used parameter to determine if SMOTE or undersampling is being used

2e03fef

Added a seed to this test for more consistent results

b48ed02

Changed balancing transformer to just check to see if the weights cha…

567264a

…nged

Set the seed to make sure the number of positive and negative compoun…

627cc20

…ds in each run is the same

Removed unnecessary loop and printed out results from the perf_data test

8decc0e

accumulate_preds ignores the id parameter for SimpleRegressionPerfDat…

317cc29

…a and KFoldClassificationPerfData. Updated the parameters and documentation to match behavior. New tests for SimpleClassificationPerfData and SimpleRegressionPerfData

the positive and negative counts are inconsistent, instead just check…

5055889

… to see that there is more of the major class

stewarthe6 requested review from mauvais2, paulsonak and mcloughlin2 October 1, 2024 17:52

stewarthe6 and others added 4 commits October 2, 2024 13:01

Merge branch 'ATOMScience-org:1.7.0' into 1.7.0

6d0abbd

Undo transformations before calculating mean and std of predictions

16d50f8

Merge branch '1.7.0' of github.com:rwilfong/AMPL into 1.7.0

3e58819

Removed pdb imports

0280941

paulsonak reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding seed for reproducibility and sampling methods #344

Adding seed for reproducibility and sampling methods #344

rwilfong commented Jul 31, 2024 •

edited

Loading

stewarthe6 commented Sep 12, 2024

stewarthe6 commented Oct 1, 2024

paulsonak left a comment

paulsonak Oct 14, 2024

paulsonak Oct 14, 2024

paulsonak Oct 14, 2024

paulsonak Oct 14, 2024

paulsonak Oct 14, 2024

paulsonak Nov 6, 2024

paulsonak Nov 6, 2024

paulsonak Nov 6, 2024

paulsonak Nov 6, 2024

paulsonak Nov 6, 2024

Adding seed for reproducibility and sampling methods #344

Are you sure you want to change the base?

Adding seed for reproducibility and sampling methods #344

Conversation

rwilfong commented Jul 31, 2024 • edited Loading

stewarthe6 commented Sep 12, 2024

stewarthe6 commented Oct 1, 2024

paulsonak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwilfong commented Jul 31, 2024 •

edited

Loading