Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have more balanced classes for training #8

Open
AlessioMilanese opened this issue Apr 17, 2020 · 1 comment
Open

Have more balanced classes for training #8

AlessioMilanese opened this issue Apr 17, 2020 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@AlessioMilanese
Copy link
Member

At the moment, when we train a node, we take all possible genes from positive and negative class. This can result in unbalanced training set, example:

[2020-04-16 18:17:28,615]    TRAIN:"1729712 Candidatus Fermentibacteria":Find genes
[2020-04-16 18:17:28,639]       SEL_GENES:"1729712 Candidatus Fermentibacteria": 3 positive, 33086 negative
[2020-04-16 18:17:28,639]          TRAIN:"1729712 Candidatus Fermentibacteria":Train classifier

where we have 3 positive classes and 33k negative classes.

We need to improve the function find_training_genes in create_db.py.

@AlessioMilanese
Copy link
Member Author

Partially solved in ba7aeae, where we do the following:

  1. limit the number of positive samples to 500 (sub-sample if there are more);
  2. limit the number of negative samples to 1,000 (sub-sample if there are more);
  3. Sub-sample negative samples, if there are more than 20 times more negative than positive samples; this is reduced to 3 times more if there was only one sibling (line 346)
  4. We want to have at least 5 times more negative than positive samples. If there are not, then we will pick them from outside the siblings. We choose randomly 5 positive samples and find the most similar samples outside of the siblings, and add those to the negative samples that we have already. Note (line 363): if we are at kingdom level, then it's not possible to add outside of the siblings (and possible_neg = 0).

Can we do better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant