Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classifier with too many sequences #7

Closed
AlessioMilanese opened this issue Apr 16, 2020 · 1 comment
Closed

Classifier with too many sequences #7

AlessioMilanese opened this issue Apr 16, 2020 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@AlessioMilanese
Copy link
Member

When running htc create_db with ~30k sequences, I get the following error:

Traceback (most recent call last):
  File "./htc", line 327, in <module>
    status = main()
  File "./htc", line 286, in main
    create_db.create_db(args.aligned_sequences, args.taxonomy, args.verbose, args.output, args.use_cm_align, args.template_al)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 714, in create_db
    classifiers = train_all_classifiers(alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 356, in train_all_classifiers
    train_node_iteratively(node, sibilings, all_classifiers, alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 327, in train_node_iteratively
    train_node_iteratively(child, sibilings_child, all_classifiers, alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 339, in train_node_iteratively
    all_classifiers, alignment, node)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 314, in train_classifier
    clf.fit(X, y)
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py", line 1532, in fit
    accept_large_sparse=solver != 'liblinear')
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 719, in check_X_y
    estimator=estimator)
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

And in the log:

[2020-04-16 18:17:28,615]    TRAIN:"1729712 Candidatus Fermentibacteria":Find genes
[2020-04-16 18:17:28,639]       SEL_GENES:"1729712 Candidatus Fermentibacteria": 3 positive, 33086 negative
[2020-04-16 18:17:28,639]          TRAIN:"1729712 Candidatus Fermentibacteria":Train classifier

In particular there are 33,086 negative labels, and the train of the classifier breaks. Another related issue, is that classes are unbalanced.

@AlessioMilanese AlessioMilanese self-assigned this Apr 16, 2020
@AlessioMilanese AlessioMilanese added the bug Something isn't working label Apr 16, 2020
@AlessioMilanese
Copy link
Member Author

The issue is not that there are too many sequences. The issue is that there were some NA's.
This is because there are some genes that are present in the taxonomy, but not in the alignment. This is now solved in 778ef63.
Note, it would still be good to have more balanced classes, hence we open issue #8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant