MegaHAL

MegaHAL is a learning chatterbot. It is trained on question/answer pairs (where the answers are known to be suitable to the questions) or on individual sentences. This is how it works.

Goals

Demonstrate that performance improves as the available data grows.
Generate reasonable conversation transcripts.
Work across all human languages.
Experiment with how performance varies with aggressiveness of pruning.

Language Models

Models are a mapping of contexts to distributions. Contexts are of fixed, predetermined length, and may contain a mixture of norms, puncs and words in arbitrary orders (that is, they're not just markovian). Distributions are maps of IDs to counts, together with a total count. The model maintains an iteration counter that is incremented each time it is updated, as does each distribution. This allows the age of a particular distribution to be calculated. Models can self-prune, during which they perform the following operations:

Very old distributions may be removed entirely.
Rare entries in a distribition may be removed entirely.
Similar distributions may be merged.

Sentence Decomposition

A sentence is decomposed into three arrays of integer IDs: puncs, norms and words.

The IDs are assigned by a dictionary. Apart from efficiency reason, dealing with arrays of IDs is preferable because the problem immediately becomes more abstract. There is no temptation to look inside the string representation of each word for meaning.

Puncs contain word separators (whitespace and punctuation) such as "? . Words contain alphanumeric words, such as Hello. Norms contain normalised versions of words, such as HELLO. All subsequent processing uses the norms exclusively.

The decomposition is done in a language-neutral fashion, using the UTF-8 encoding, and falling back to character segmentation when word boundaries don't exist or cannot reliably be determined.

Sentence Reconsruction

Given a novel array of norms, we wish to display a correctly punctuated and capitalised sentence to the user. We do this as follows:

Generate a puncs array that represents the most likely sequence of punctuation between consecutive elements in the norms array.
Generate a words array that represents the most likely capitalisation of elements in the norms array, given the surrounding context of puncs and norms.

Punctuation Model

Generating the puncs array occurs in several steps:

Determine, for each slot to be filled, a distribution over candidate punctuation elements given the surrounding context of norms. From these distributions we can generate many different puncs arrays and calculate the probability of each.
Use a language model to measure the likelihood of each generation being the best one. This takes into account both local and long-distance context so that brackets, quotation characters and so on are more likely to be correctly matched.

This second model is complicated. We can use a markovian predictor to give us local context, but incorporating long-distance information is something I need to think about more.

Capitalisation Model

Four predictors are used, each of which uses one of the surrounding context IDs from the puncs and norms arrays. Predictions are blended and the maximum selected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly