-
Notifications
You must be signed in to change notification settings - Fork 0
Home
The goal of this work is to design an architecture for autoregressive modelling that has an induction bias towards learning tempoorally compressed representations that retains the benefits of Transformers while preserving long-range interactions.
The fast stream has a short term memory with a high capacity that reacts quickly to sensory input. This is modelled with Transformers.
The slow stream has a long term memory which updates at a slower rate and summarizes the most important information in the input sequence.
- Divide the input into fixed size chunks.
- Fast stream operates within each chunk.
- Slow stream consolidates and aggregates information across chunks.
The fast and slow stream induce information asymmetry.
Fast Stream | Slow Stream |
---|---|
fine grained | coarse grained |
local information | distant information |
The fast and slow stream interact with each other through bottleneck of attention.
- Given an input sequence
$X = [x_{0}, x_{1}, \dots, x_{T}]$ , it is divided into chunks of fixed size$K$ . Each chunk is referred to as$X_{l}$ , where$l = 0, 1, \dots \lfloor\frac{T}{K}\rfloor$
- Each chunk is processed by a perceptual module
$\mathcal{F}$ (fast stream). Note: While processing, the perceptual module is also conditioned on the information form the temporal latent bottleneck$\mathcal{G}$ (slow stream).
- The temporal latent bottleneck is recurrent in nature and has a hidden state of its own
$\mathcal{I}$ which is set of$N$ vectors.
$$ \bar{X}{l+1} = \mathcal{G}(\bar{X}{l}, \mathcal{I}_{l}) $$