Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding seed for reproducibility and sampling methods #344

Open
wants to merge 53 commits into
base: 1.7.0
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
e4d8871
sampling and seed
rwilfong Jul 31, 2024
22b0318
now it runs
stewarthe6 Jul 19, 2024
30ea360
kfold changes
stewarthe6 Jul 19, 2024
dc1f7c4
seed test
rwilfong Jul 31, 2024
7b13967
ruff linter suggestions
rwilfong Jul 31, 2024
6fb1c62
updated kfoldregression
rwilfong Aug 1, 2024
480e5f1
Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0
stewarthe6 Sep 11, 2024
fc24463
added imblearn to pip requirements
stewarthe6 Sep 11, 2024
561c3bb
unpin imblearn
stewarthe6 Sep 11, 2024
49dc67b
Clean up unused random_state or seed parameters or assignments.
stewarthe6 Sep 11, 2024
b41b7d5
fixed merging error
stewarthe6 Sep 11, 2024
b65ba09
Fixed find and replace bug
stewarthe6 Sep 11, 2024
84babd2
make_dc_model does not need random_state or seed arguments
stewarthe6 Sep 11, 2024
ecf23bd
fhnew changes
rwilfong Sep 12, 2024
a821f6a
Changed constructor of ProductionSplitter to call Splitting's init fu…
stewarthe6 Sep 12, 2024
319b2f0
resolving errors
rwilfong Sep 12, 2024
31f3d5f
removed heads
rwilfong Sep 12, 2024
d074f65
removed unused library
rwilfong Sep 12, 2024
b0ecc05
Merge remote-tracking branch 'upstream/1.7.0' into 1.7.0
stewarthe6 Sep 12, 2024
2992bdf
Added more models for seeding test.
stewarthe6 Sep 12, 2024
ccebaed
Fixed seed for GCNModel. Should pass regularly now.
stewarthe6 Sep 12, 2024
dcc4809
Set seed to guarantee resuts in class_config_delaney_fit_nn_ecfp.json
stewarthe6 Sep 12, 2024
922bf0c
Moved 'test' from suffix to prefix
stewarthe6 Sep 18, 2024
82838d1
Renamed these test files to start with test_ so they're caught by the…
stewarthe6 Sep 19, 2024
4e471cb
Changed MultitaskScaffoldSplit and GeneticAlgorithm to use a Generate…
stewarthe6 Sep 19, 2024
baa5478
Added test for MTSS seed and fixed a few cases were the wrong random …
stewarthe6 Sep 19, 2024
4eb4ee4
renamed this file to match wahts in test_seed_splitting.py
stewarthe6 Sep 19, 2024
4588a9d
renamed this to match the test
stewarthe6 Sep 19, 2024
ff58d02
Removed try except blocks in test code. We need to see these errors
stewarthe6 Sep 24, 2024
0028ed7
Added seed to this test so that it passes more consistently
stewarthe6 Sep 24, 2024
0c83b6b
combined_training_data now accounts for synthetic datasets
stewarthe6 Sep 24, 2024
ada3ea8
accept changes
rwilfong Sep 24, 2024
4dd5d99
integrate changes
rwilfong Sep 24, 2024
0a616b2
set uncertainty false for classification test since it is unsupported…
stewarthe6 Sep 24, 2024
16c2a4a
update branchMerge branch '1.7.0' of https://github.com/rwilfong/AMPL…
rwilfong Sep 25, 2024
c3b1922
updated tests
rwilfong Sep 25, 2024
f2a30a9
resolve errors
rwilfong Sep 25, 2024
410f03d
Added seed to test_balancing_transformer for more consistent outputs
stewarthe6 Sep 25, 2024
f247893
added a test to make sure that multitask problems don't work with SMOTE
stewarthe6 Sep 25, 2024
2e03fef
Used parameter to determine if SMOTE or undersampling is being used
stewarthe6 Sep 25, 2024
b48ed02
Added a seed to this test for more consistent results
stewarthe6 Sep 25, 2024
567264a
Changed balancing transformer to just check to see if the weights cha…
stewarthe6 Sep 26, 2024
627cc20
Set the seed to make sure the number of positive and negative compoun…
stewarthe6 Sep 26, 2024
8decc0e
Removed unnecessary loop and printed out results from the perf_data test
stewarthe6 Sep 30, 2024
317cc29
accumulate_preds ignores the id parameter for SimpleRegressionPerfDat…
stewarthe6 Sep 30, 2024
5055889
the positive and negative counts are inconsistent, instead just check…
stewarthe6 Sep 30, 2024
6d0abbd
Merge branch 'ATOMScience-org:1.7.0' into 1.7.0
stewarthe6 Oct 2, 2024
16d50f8
Undo transformations before calculating mean and std of predictions
stewarthe6 Oct 28, 2024
3e58819
Merge branch '1.7.0' of github.com:rwilfong/AMPL into 1.7.0
stewarthe6 Oct 28, 2024
0280941
Removed pdb imports
stewarthe6 Oct 28, 2024
a4c2b83
Updated help for 'seed' input
stewarthe6 Nov 27, 2024
8e29047
Removed commented out seed
stewarthe6 Nov 27, 2024
268ba05
model_retrian has an option to either keep or discard the saved seed.…
stewarthe6 Nov 27, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions atomsci/ddm/docs/PARAMETERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,14 @@ The AMPL pipeline contains many parameters and options to fit models and make pr
|*Description:*|True/False flag for setting verbosity|
|*Default:*|FALSE|
|*Type:*|Bool|

- **seed**

|||
|-|-|
|*Description:*|Seed used for initializing a random number generator to ensure results are reproducible. Default is None and a random seed will be generated.|
|*Default:*|None|
|*Type:*|int|

- **production**

Expand Down Expand Up @@ -529,6 +537,30 @@ the model will train for max_epochs regardless of validation error.|
|*Default:*|scaffold|
|*Type:*|str|

- **sampling_method**

|||
|-|-|
|*Description:*|The sampling method for addressing class imbalance in classification datasets. Options include 'undersampling' and 'SMOTE'.|
|*Default:*|None|
|*Type:*|str|

- **sampling_ratio**

|||
|-|-|
|*Description:*|The desired ratio of the minority class to the majority class after sampling (e.g., if str, 'minority', 'not minority'; if float, '0.2', '1.0'). |
|*Default:*|auto|
|*Type:*|str|

- **sampling_k_neighbors**

|||
|-|-|
|*Description:*|The number of nearest neighbors to consider when generating synthetic samples (e.g., 5, 7, 9). Specifically used for SMOTE sampling method.|
|*Default:*|5|
|*Type:*|int|

- **mtss\_num\_super\_scaffolds**

|||
Expand Down
25 changes: 16 additions & 9 deletions atomsci/ddm/pipeline/GeneticAlgorithm.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import numpy as np
import uuid
import scipy.spatial.distance as scipy_distance
import multiprocessing
import random
from tqdm import tqdm
import timeit
from typing import Any, Callable, List, Tuple
from typing import Any, Callable, List, Tuple, Optional

N_PROCS = multiprocessing.cpu_count()

Expand All @@ -22,7 +22,8 @@ def __init__(self,
init_pop: List[List[Any]],
fitness_func: Callable,
crossover_func: Callable,
mutate_func: Callable):
mutate_func: Callable,
seed: Optional[int]):
"""
Creates a GeneticAlgorithm object

Expand All @@ -40,8 +41,14 @@ def __init__(self,
mutate_func: Callable
A callable that takes a list of chromosomes and returns another list of mutated
chromosomes
seed: Optional[int]
Seed for random number generator
"""

if seed is None:
seed = uuid.uuid4().int % (2**32)
self.random_state = np.random.default_rng(seed)

self.pop = init_pop
self.pop_scores = None
self.num_pop = len(init_pop)
Expand Down Expand Up @@ -177,13 +184,13 @@ def step(self, print_timings: bool = False):

# select parents using rank selection
i = timeit.default_timer()
new_pop = self.crossover_func(parents, self.num_pop)
new_pop = self.crossover_func(parents, self.num_pop, random_state=self.random_state)
if print_timings:
print('\tcrossover %0.2f min'%((timeit.default_timer()-i)/60))

# mutate population
i = timeit.default_timer()
self.pop = self.mutate_func(new_pop)
self.pop = self.mutate_func(new_pop, random_state=self.random_state)
if print_timings:
print('\tmutate %0.2f min'%((timeit.default_timer()-i)/60))
print('total %0.2f min'%((timeit.default_timer()-start)/60))
Expand All @@ -199,23 +206,23 @@ def step(self, print_timings: bool = False):
def fitness_func(chromosome):
return 1 - scipy_distance.rogerstanimoto(chromosome, target_chromosome)

def crossover_func(parents, pop_size):
def crossover_func(parents, pop_size, random_state):
new_pop = []
for i in range(num_pop):
parent1 = parents[i%len(parents)]
parent2 = parents[(i+1)%len(parents)]

crossover_point = random.randint(0, len(parents[0])-1)
crossover_point = random_state.integers(0, len(parents[0])-1, 1)[0]
new_pop.append(parent1[:crossover_point]+parent2[crossover_point:])

return new_pop

def mutate_func(pop, mutate_chance=0.01):
def mutate_func(pop, random_state, mutate_chance=0.01):
new_pop = []
for chromosome in pop:
new_chromosome = list(chromosome)
for i, g in enumerate(new_chromosome):
if random.random() < mutate_chance:
if random_state.random() < mutate_chance:
if new_chromosome[i] == 0:
new_chromosome[i] = 1
else:
Expand Down
27 changes: 17 additions & 10 deletions atomsci/ddm/pipeline/MultitaskScaffoldSplit.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import argparse
import logging
import random
import timeit
import tempfile
from typing import List, Optional, Set, Tuple
Expand Down Expand Up @@ -636,8 +635,8 @@ def split(self,
A tuple with 3 elements that are training, validation, and test compound
indices into dataset, respectively
"""
if seed is not None:
np.random.seed(seed)
self.seed = seed

self.dataset = dataset
self.diff_fitness_weight_tvt = diff_fitness_weight_tvt
self.diff_fitness_weight_tvv = diff_fitness_weight_tvv
Expand Down Expand Up @@ -674,7 +673,7 @@ def split(self,
population.append(split_chromosome)

gene_alg = ga.GeneticAlgorithm(population, self.grade, ga_crossover,
ga_mutate)
ga_mutate, self.seed)
#gene_alg.iterate(num_generations)
for i in range(self.num_generations):
gene_alg.step(print_timings=print_timings)
Expand Down Expand Up @@ -859,7 +858,8 @@ def train_valid_test_split(self,
return train_dataset, valid_dataset, test_dataset

def ga_crossover(parents: List[List[str]],
num_pop: int) -> List[List[str]]:
num_pop: int,
random_state: np.random.Generator) -> List[List[str]]:
"""Create the next generation from parents

A random index is chosen and genes up to that index from
Expand All @@ -872,6 +872,8 @@ def ga_crossover(parents: List[List[str]],
A list of chromosomes.
num_pop: int
The number of new chromosomes to make
random_state: np.random.Generator
Random number generator
Returns
-------
List[List[str]]
Expand All @@ -883,13 +885,14 @@ def ga_crossover(parents: List[List[str]],
parent1 = parents[i%len(parents)]
parent2 = parents[(i+1)%len(parents)]

crossover_point = random.randint(0, len(parents[0])-1)
crossover_point = random_state.integers(low=0, high=len(parents[0])-1, size=1)[0]
new_pop.append(parent1[:crossover_point]+parent2[crossover_point:])

return new_pop

def ga_mutate(new_pop: List[List[str]],
mutation_rate: float = .02) -> List[List[str]]:
random_state: np.random.Generator,
mutation_rate: float = .02,) -> List[List[str]]:
"""Mutate the population

Each chromosome is copied and mutated at mutation_rate.
Expand All @@ -900,6 +903,8 @@ def ga_mutate(new_pop: List[List[str]],
----------
new_pop: List[List[str]]
A list of chromosomes.
random_state: np.random.Generator
Random number generator
mutation_rate: float
How often a mutation occurs. 0.02 is a good rate for
my test sets.
Expand All @@ -912,8 +917,8 @@ def ga_mutate(new_pop: List[List[str]],
for solution in new_pop:
new_solution = list(solution)
for i, gene in enumerate(new_solution):
if random.random() < mutation_rate:
new_solution[i] = ['train', 'valid', 'test'][random.randint(0,2)]
if random_state.random() < mutation_rate:
new_solution[i] = ['train', 'valid', 'test'][random_state.integers(low=0, high=2, size=1)[0]]
mutated.append(new_solution)

return mutated
Expand Down Expand Up @@ -1039,6 +1044,7 @@ def parse_args():
parser.add_argument('id_col', type=str, help='the column containing ids')
parser.add_argument('response_cols', type=str, help='comma seperated string of response columns')
parser.add_argument('output', type=str, help='name of the split file')
parser.add_argument('seed', type=int, default=0, help='name of the split file')
stewarthe6 marked this conversation as resolved.
Show resolved Hide resolved

return parser.parse_args()

Expand All @@ -1054,5 +1060,6 @@ def parse_args():
mss = MultitaskScaffoldSplitter()
mss_split_df = split_with(total_df, mss,
smiles_col=args.smiles_col, id_col=args.id_col, response_cols=response_cols,
diff_fitness_weight=dfw, ratio_fitness_weight=rfw, num_generations=args.num_gens)
diff_fitness_weight=dfw, ratio_fitness_weight=rfw, num_generations=args.num_gens,
seed=args.seed)
mss_split_df.to_csv(args.output, index=False)
37 changes: 32 additions & 5 deletions atomsci/ddm/pipeline/model_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -433,7 +433,7 @@ def get_dataset_tasks(self, dset_df):
return self.tasks is not None

# ****************************************************************************************
def split_dataset(self):
def split_dataset(self, random_state=None, seed=None):
"""Splits the dataset into paired training/validation and test subsets, according to the split strategy
selected by the model params. For traditional train/valid/test splits, there is only one training/validation
pair. For k-fold cross-validation splits, there are k different train/valid pairs; the validation sets are
Expand All @@ -452,7 +452,7 @@ def split_dataset(self):

# Create object to delegate splitting to.
if self.splitting is None:
self.splitting = split.create_splitting(self.params)
self.splitting = split.create_splitting(self.params, random_state=random_state, seed=seed)
self.train_valid_dsets, self.test_dset, self.train_valid_attr, self.test_attr = \
self.splitting.split_dataset(self.dataset, self.attr, self.params.smiles_col)
if self.train_valid_dsets is None:
Expand All @@ -479,6 +479,12 @@ def _check_classes(self):
(Boolean): boolean specifying if all classes are specified in all splits
"""
ref_class_set = get_classes(self.train_valid_dsets[0][0].y)
num_classes = len(ref_class_set)
if num_classes != self.params.class_number:
logger = logging.getLogger('ATOM')
logger.warning(f"Expected class_number:{self.params.class_number} "
f"classes but got {num_classes} instead. Double check "
"response columns or class_number parameter.")
for train, valid in self.train_valid_dsets:
if not ref_class_set == get_classes(train.y):
return False
Expand Down Expand Up @@ -563,7 +569,7 @@ def create_dataset_split_table(self):
return split_df

# ****************************************************************************************
def load_presplit_dataset(self, directory=None):
def load_presplit_dataset(self, directory=None, random_state=None, seed=None):
"""Loads a table of compound IDs assigned to split subsets, and uses them to split
the currently loaded featurized dataset.

Expand All @@ -590,7 +596,7 @@ def load_presplit_dataset(self, directory=None):
"""

# Load the split table from the datastore or filesystem
self.splitting = split.create_splitting(self.params)
self.splitting = split.create_splitting(self.params, random_state=random_state, seed=seed)

try:
split_df, split_kv = self.load_dataset_split_table(directory)
Expand Down Expand Up @@ -655,11 +661,31 @@ def combined_training_data(self):
# All of the splits have the same combined train/valid data, regardless of whether we're using
# k-fold or train/valid/test splitting.
if self.combined_train_valid_data is None:
# normally combining one fold is sufficient, but if SMOTE or undersampling is being used
# just combining the first fold isn't enough
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For undersampling, it looks like it assumes that K-fold undersampling would sample the entire non-test dataset. What if this isn't the case? Is this assumption ensured elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a compound can ever be wiped entirely out of existence due to undersampling. Undersampling is only applied to the training set of each fold.
And since every compound has a 'turn' in the validation set, that compound must appear at least once.

This isn't tested, but do we need to test it anywhere? I think it's ok if a compound is dropped entirely, since that's what happens when using undersampling without k-fold validation.

(train, valid) = self.train_valid_dsets[0]
combined_X = np.concatenate((train.X, valid.X), axis=0)
combined_y = np.concatenate((train.y, valid.y), axis=0)
combined_w = np.concatenate((train.w, valid.w), axis=0)
combined_ids = np.concatenate((train.ids, valid.ids))

if self.params.sampling_method=='SMOTE' or self.params.sampling_method=='undersampling':
# for each successive fold, merge in any new compounds
# this loop just won't run if there are no additional folds
for train, valid in self.train_valid_dsets[1:]:
fold_ids = np.concatenate((train.ids, valid.ids))
new_id_indexes = [i for i in range(len(fold_ids)) if i not in combined_ids]

fold_ids = fold_ids[new_id_indexes]
fold_X = np.concatenate((train.X, valid.X), axis=0)[new_id_indexes]
fold_y = np.concatenate((train.y, valid.y), axis=0)[new_id_indexes]
fold_w = np.concatenate((train.w, valid.w), axis=0)[new_id_indexes]

combined_X = np.concatenate((combined_X, fold_X), axis=0)
combined_y = np.concatenate((combined_y, fold_y), axis=0)
combined_w = np.concatenate((combined_w, fold_w), axis=0)
combined_ids = np.concatenate((combined_ids, fold_ids))

self.combined_train_valid_data = NumpyDataset(combined_X, combined_y, w=combined_w, ids=combined_ids)
return self.combined_train_valid_data

Expand Down Expand Up @@ -697,7 +723,8 @@ def get_subset_responses_and_weights(self, subset, transformers):
"""
if subset not in self.subset_response_dict:
if subset in ('train', 'valid', 'train_valid'):
dataset = self.combined_training_data()
for fold, (train, valid) in enumerate(self.train_valid_dsets):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are just looking for new compounds in each fold, can you just concatenate all train_valid_dsets and then call set(ids) or drop_duplicates() or something? Might make the code more efficient than multiple for loops, but I'm not sure if it is actually easier based on the way the datasets are stored.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These datasets are NumpyDatasets and contain an X matrix, y matrix, w matrix, and ids. I'd have to put them into a data frame, call drop_duplicates, and then put it back into a NumpyDataset.

However, I think I can get rid of that loop on 726. That's not necessary.

dataset = self.combined_training_data()
elif subset == 'test':
dataset = self.test_dset
else:
Expand Down
Loading
Loading