Home

Containers needed:

MongoDB Container
Redis Container
Web Frontend (Angular) Container
ETL Container
ML Container
Pre-processor Container

The ETL input will look like:

{ 'text': 'text to classify', 'uuid': 'a uuid mapping back to the file used' }

The Preprocessor Container

... takes text, uuid off of the Redis queue

... writes data in this format to MongoDB Container, collection 'unlabeled_skills'

{
description: string, // preprocessed skill string
not_skill: 0 (int), // counter for number of "not a skill" labels
is_a_skill: 0 (int), // counter for number of "is a skill" labels
uuid: string, // uuid of the text, maps back to a dataset 
details: {
          source_date: date // date this preprocessed skill created
          preprocessor_id: string // version id of the preprocessor
         }
}

The Machine Learning Container

... samples the MongoDB Container, collection 'unlabeled_skills'

... upserts the following into the document

{
 ...
details: {
          ...
          predicted_probability: float, // probability description is a skill*
          date_predicted: date, // date this prediction was made
          oracle_importance: float, // oracle active learning importance (not confidence)
          oracle_id: str, // version of the oracle (vowpal wabbit)
          predicted_count: integer, // number of times predicted
          taught: bool, // indicator if the oracle used this example to teach itself or not*
         }
}

Where the front end does biased sampling (we can do random from top N for now) over oracle_importance

vowpal wabbit note: use --link=logistic, uses 1/(1+exp(-x)), and --loss_function=logistic. Or we can use --link=glf1 for [-1, +1] limits. see vowpal wabbit predicting probabilities
oracle note: examples that have taught: false and not_a_skill/is_a_skill ratios can be used for hold out testing of the accuracy of the oracle. This is orcale metadata periodically updated and stored, available via an api call and maybe displayed as a line chart in the front end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Clone this wiki locally