-
Notifications
You must be signed in to change notification settings - Fork 1
Home
Containers needed:
- MongoDB Container
- Redis Container
- Web Frontend (Angular) Container
- ETL Container
- ML Container
- Pre-processor Container
The ETL input will look like:
{ 'text': 'text to classify', 'uuid': 'a uuid mapping back to the file used' }
The Preprocessor Container
... takes text, uuid off of the Redis queue
... writes data in this format to MongoDB Container, collection 'unlabeled_skills'
{
description: string, // preprocessed skill string
not_skill: 0 (int), // counter for number of "not a skill" labels
is_a_skill: 0 (int), // counter for number of "is a skill" labels
uuid: string, // uuid of the text, maps back to a dataset
details: {
source_date: date // date this preprocessed skill created
preprocessor_id: string // version id of the preprocessor
}
}
The Machine Learning Container
... samples the MongoDB Container, collection 'unlabeled_skills'
... upserts the following into the document
{
...
details: {
...
predicted_probability: float, // probability description is a skill*
date_predicted: date, // date this prediction was made
oracle_importance: float, // oracle active learning importance (not confidence)
oracle_id: str, // version of the oracle (vowpal wabbit)
predicted_count: integer, // number of times predicted
taught: bool, // indicator if the oracle used this example to teach itself or not*
}
}
Where the front end does biased sampling (we can do random from top N for now) over oracle_importance
-
vowpal wabbit note: use
--link=logistic
, uses 1/(1+exp(-x)), and--loss_function=logistic
. Or we can use--link=glf1
for [-1, +1] limits. see vowpal wabbit predicting probabilities -
oracle note: examples that have
taught: false
andnot_a_skill
/is_a_skill
ratios can be used for hold out testing of the accuracy of the oracle. This is orcale metadata periodically updated and stored, available via an api call and maybe displayed as a line chart in the front end.