Skip to content
Gregory Mundy edited this page Aug 25, 2017 · 6 revisions

Containers needed:

  • MongoDB Container
  • Redis Container
  • Web Frontend (Angular) Container
  • ETL Container
  • ML Container
  • Pre-processor Container

The ETL input will look like:

{ 'text': 'text to classify', 'uuid': 'a uuid mapping back to the file used' }

The Preprocessor Container

... takes text, uuid off of the Redis queue

... writes data in this format to MongoDB Container, collection 'unlabeled_skills'

{
description: string, // preprocessed skill string
not_skill: 0 (int), // counter for number of "not a skill" labels
is_a_skill: 0 (int), // counter for number of "is a skill" labels
uuid: string, // uuid of the text, maps back to a dataset 
details: {
          source_date: date // date this preprocessed skill created
          preprocessor_id: string // version id of the preprocessor
         }
}

The Machine Learning Container

... samples the MongoDB Container, collection 'unlabeled_skills'

... upserts the following into the document

{
 ...
details: {
          ...
          predicted_probability: float, // probability description is a skill*
          date_predicted: date, // date this prediction was made
          oracle_importance: float, // oracle active learning importance (not confidence)
          oracle_id: str, // version of the oracle (vowpal wabbit)
          predicted_count: integer, // number of times predicted
          taught: bool, // indicator if the oracle used this example to teach itself or not*
         }
}

Where the front end does biased sampling (we can do random from top N for now) over oracle_importance

  • vowpal wabbit note: use --link=logistic, uses 1/(1+exp(-x)), and --loss_function=logistic. Or we can use --link=glf1 for [-1, +1] limits. see vowpal wabbit predicting probabilities

  • oracle note: examples that have taught: false and not_a_skill/is_a_skill ratios can be used for hold out testing of the accuracy of the oracle. This is orcale metadata periodically updated and stored, available via an api call and maybe displayed as a line chart in the front end.

Clone this wiki locally