Skip to content
Aritra Roy Gosthipaty edited this page Feb 22, 2023 · 7 revisions

The goal of this work is to design an architecture for autoregressive modelling that has an induction bias towards learning tempoorally compressed representations that retains the benefits of Transformers while preserving long-range interactions.

Perceptual Module (Fast Stream)

The fast stream has a short term memory with a high capacity that reacts quickly to sensory input. This is modelled with Transformers.

Temporal Latent Bottleneck (Slow Stream)

The slow stream has a long term memory which updates at a slower rate and summarizes the most important information in the input sequence.

Implementation

  • Divide the input into fixed size chunks.
  • Fast stream operates within each chunk.
  • Slow stream consolidates and aggregates information across chunks.

Information Asymmetry

The fast and slow stream induce information asymmetry.

Fast Stream Slow Stream
fine grained coarse grained
local information distant information

The fast and slow stream interact with each other through bottleneck of attention.

Methodology

  1. Given an input sequence $X = [x_{0}, x_{1}, \dots, x_{T}]$, it is divided into chunks of fixed size $K$. Each chunk is referred to as $X_{l}$, where $l = 0, 1, \dots \lfloor\frac{T}{K}\rfloor$

$$ X \to \{ x_{1}, \dots, x_{\frac{T}{K}} \} $$

  1. Each chunk is processed by a perceptual module $\mathcal{F}$ (fast stream). Note: While processing, the perceptual module is also conditioned on the information form the temporal latent bottleneck $\mathcal{G}$ (slow stream).

$$ \bar{X_{l}} = \mathcal{F}(X_{l}, \mathcal{I}_{l}) $$

  1. The temporal latent bottleneck is recurrent in nature and has a hidden state of its own $\mathcal{I}$ which is set of $N$ vectors.

$$ \bar{X}{l+1} = \mathcal{G}(\bar{X}{l}, \mathcal{I}_{l}) $$