Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding seed for reproducibility and sampling methods #344
base: 1.7.0
Are you sure you want to change the base?
Adding seed for reproducibility and sampling methods #344
Changes from 50 commits
e4d8871
22b0318
30ea360
dc1f7c4
7b13967
6fb1c62
480e5f1
fc24463
561c3bb
49dc67b
b41b7d5
b65ba09
84babd2
ecf23bd
a821f6a
319b2f0
31f3d5f
d074f65
b0ecc05
2992bdf
ccebaed
dcc4809
922bf0c
82838d1
4e471cb
baa5478
4eb4ee4
4588a9d
ff58d02
0028ed7
0c83b6b
ada3ea8
4dd5d99
0a616b2
16c2a4a
c3b1922
f2a30a9
410f03d
f247893
2e03fef
b48ed02
567264a
627cc20
8decc0e
317cc29
5055889
6d0abbd
16d50f8
3e58819
0280941
a4c2b83
8e29047
268ba05
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For undersampling, it looks like it assumes that K-fold undersampling would sample the entire non-test dataset. What if this isn't the case? Is this assumption ensured elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a compound can ever be wiped entirely out of existence due to undersampling. Undersampling is only applied to the training set of each fold.
And since every compound has a 'turn' in the validation set, that compound must appear at least once.
This isn't tested, but do we need to test it anywhere? I think it's ok if a compound is dropped entirely, since that's what happens when using undersampling without k-fold validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are just looking for new compounds in each fold, can you just concatenate all train_valid_dsets and then call set(ids) or drop_duplicates() or something? Might make the code more efficient than multiple for loops, but I'm not sure if it is actually easier based on the way the datasets are stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These datasets are NumpyDatasets and contain an X matrix, y matrix, w matrix, and ids. I'd have to put them into a data frame, call drop_duplicates, and then put it back into a NumpyDataset.
However, I think I can get rid of that loop on 726. That's not necessary.