-
Notifications
You must be signed in to change notification settings - Fork 18
MegaHAL
MegaHAL is a learning chatterbot. It is trained on question/answer pairs (where the answers are known to be suitable to the questions) or on individual sentences. This is how it works.
- Demonstrate that performance improves as the available data grows.
- Generate reasonable conversation transcripts.
- Work across all human languages.
- Experiment with how performance varies with aggressiveness of pruning.
Models are a mapping of contexts to distributions. Contexts are of fixed, predetermined length, and may contain a mixture of norms, puncs and words in arbitrary orders (that is, they're not just markovian). Distributions are maps of IDs to counts, together with a total count. The model maintains an iteration counter that is incremented each time it is updated, as does each distribution. This allows the age of a particular distribution to be calculated. Models can self-prune, during which they perform the following operations:
- Very old distributions may be removed entirely.
- Rare entries in a distribition may be removed entirely.
- Similar distributions may be merged.
A sentence is decomposed into three arrays of integer IDs: puncs, norms and words.
The IDs are assigned by a dictionary. Apart from efficiency reason, dealing with arrays of IDs is preferable because the problem immediately becomes more abstract. There is no temptation to look inside the string representation of each word for meaning.
Puncs contain word separators (whitespace and punctuation) such as "?
. Words contain alphanumeric words, such as Hello
. Norms contain normalised versions of words, such as HELLO
. All subsequent processing uses the norms exclusively.
The decomposition is done in a language-neutral fashion, using the UTF-8 encoding, and falling back to character segmentation when word boundaries don't exist or cannot reliably be determined.
Given a novel array of norms, we wish to display a correctly punctuated and capitalised sentence to the user. We do this as follows:
- Generate a puncs array that represents the most likely sequence of punctuation between consecutive elements in the norms array.
- Generate a words array that represents the most likely capitalisation of elements in the norms array, given the surrounding context of puncs and norms.
Generating the puncs array occurs in several steps:
- Determine, for each slot to be filled, a distribution over candidate punctuation elements given the surrounding context of norms. From these distributions we can generate many different puncs arrays and calculate the probability of each.
- Use a language model to measure the likelihood of each generation being the best one. This takes into account both local and long-distance context so that brackets, quotation characters and so on are more likely to be correctly matched.
This second model is complicated. We can use a markovian predictor to give us local context, but incorporating long-distance information is something I need to think about more.
Four predictors are used, each of which uses one of the surrounding context IDs from the puncs and norms arrays. Predictions are blended and the maximum selected.