Classifier with too many sequences #7

AlessioMilanese · 2020-04-16T18:49:14Z

When running htc create_db with ~30k sequences, I get the following error:

Traceback (most recent call last):
  File "./htc", line 327, in <module>
    status = main()
  File "./htc", line 286, in main
    create_db.create_db(args.aligned_sequences, args.taxonomy, args.verbose, args.output, args.use_cm_align, args.template_al)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 714, in create_db
    classifiers = train_all_classifiers(alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 356, in train_all_classifiers
    train_node_iteratively(node, sibilings, all_classifiers, alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 327, in train_node_iteratively
    train_node_iteratively(child, sibilings_child, all_classifiers, alignment, full_taxonomy)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 339, in train_node_iteratively
    all_classifiers, alignment, node)
  File "/Users/milanese/Dropbox/PhD/bin/htc/bin/create_db.py", line 314, in train_classifier
    clf.fit(X, y)
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py", line 1532, in fit
    accept_large_sparse=solver != 'liblinear')
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 719, in check_X_y
    estimator=estimator)
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 542, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/Users/milanese/miniconda3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

And in the log:

[2020-04-16 18:17:28,615]    TRAIN:"1729712 Candidatus Fermentibacteria":Find genes
[2020-04-16 18:17:28,639]       SEL_GENES:"1729712 Candidatus Fermentibacteria": 3 positive, 33086 negative
[2020-04-16 18:17:28,639]          TRAIN:"1729712 Candidatus Fermentibacteria":Train classifier

In particular there are 33,086 negative labels, and the train of the classifier breaks. Another related issue, is that classes are unbalanced.

The text was updated successfully, but these errors were encountered:

AlessioMilanese · 2020-04-17T10:01:00Z

The issue is not that there are too many sequences. The issue is that there were some NA's.
This is because there are some genes that are present in the taxonomy, but not in the alignment. This is now solved in 778ef63.
Note, it would still be good to have more balanced classes, hence we open issue #8.

AlessioMilanese self-assigned this Apr 16, 2020

AlessioMilanese added the bug Something isn't working label Apr 16, 2020

AlessioMilanese closed this as completed Apr 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classifier with too many sequences #7

Classifier with too many sequences #7

AlessioMilanese commented Apr 16, 2020

AlessioMilanese commented Apr 17, 2020

Classifier with too many sequences #7

Classifier with too many sequences #7

Comments

AlessioMilanese commented Apr 16, 2020

AlessioMilanese commented Apr 17, 2020