Links between literature on addressing transformer complexity and work on recurrent neural works
This post addresses some similarities in the ways the Deep Learning field addressed two seemingly different problems, namely
A) recent work on neural networks with self-attentation (“transformers”) which addresses one of their biggest shortcomings, the quadratic computational complexity with respect to the set of input items, and
B) work decoupling capacity and memory size in recurrent neural networks.
The content of this post in a nutshell:
Capacity and representation size in recurrent neural networks
In their most classical version, recurrent neural networks keep a recurrent memory h, which is updated at each time instant given an observation x and two parameter matrices U and W:
More modern variants like LSTMs or GRUs differ in that additional gates and non-linearities are added, but they do not change the essence of what will be discussed here. As can be seen in a graphical version of the equation below, RNNs tightly couple two different properties of the model, which we would like to adjust separately: the size of the hidden state h, and the capacity of the network:
In particular, the size of the transition weight matrix U is a square of the dimension of the state h. For vanilla RNNs, the capacity of the model therefore can’t be decreased without decreasing state dimensions, although we sometimes would like to decouple these two properties: the dimension of the state governs the amount of information we need to propagate over time, whereas the capacity governs the complexity of the mathematical function approximating the dynamic of the system. There is no reason why these two properties should be related.
Several solutions have been proposed for this problem:
(i) low rank decompositions of the transition matrix U, or similar mathematical constraints put on its structure, for instance .
(ii) Remove a part of the hidden state and transfer it into an “external” representation, which is not replaced in its entirety at each iteration, but rather updated sparsely through attention, which decreases the capacity necessary for updates. This is the strategy employed by Memory Networks , Neural Turing Machines  and their variants and follow-ups.
The two solutions share similar goals but achieve them differently. The first approach (i) keeps the mathematical structure of recurrent networks and adds constraints on one of the weight matrices, which is a smaller change to the method. In contrast, (ii) fundamentally changes how the model is structured by outsourcing memory to an external component.
Transformers and complexity
Transformers  are neural networks built on query, key, value attention using dot-product similarity. Without going into details of how transformers work, the expensive part is the calculation of pairwise similarities between items of a set
or, expressed as a softmax over an attention matrix A,
where X is the set of items, each item represented as vector, and W and V are two weight matrices, which project linearly project the input vectors into a space in which the dot-product similarity is meaningful (in the sense of the task to learn). It is easy to see that this operation is O(N²), which creates troubles in applications where the number of items is big, like images , videos  etc. One strategy is to sparsify interactions and allow each layer to attend to a subset of items only , which solves the problem externally and does not touch on the concept of transformers itself.
Two of the most popular attempts to address this problem bear some resemblance to the two different ways (i) and (ii) for decoupling capacity and memory representation in RNNs. However, instead of focusing on presentation size, here the focus is on the calculations and interactions between data items:
a) instead of performing the full attention calculations for all item pairs, involving a full attention matrix A, the matrix can be decomposed using a matrix low rank decomposition, which can lead to a more efficient computation [10, 11] and others. The underlying hypothesis is that the attention operation itself is mostly low rank and can be described by a couple of singular values. This is somewhat similar to decomposing the RNN update equation, and thus enforcing a low rank of the dynamic of a dynamic system.
b) A recent paper  proposes an alternative approach, somewhat similar to (ii), which externalizes the problem. Instead of allowing all possible pairwise interactions of input items, the interactions are transferred into an external (latent) representation. The “Perceiver” , see figure below, allows self-attention only in this low dimensional latent space where attention operations are cheaper. The high-dimensional input space only interacts with the latent representation, never with itself (no self-attention in input space).
The comparison might be a stretch, but it is still interesting. As in the first problem (decoupling capacity and representation size in RNNs), the two solutions (a) and (b) share similar goals but achieve them differently. The first approach (a) keeps the mathematical structure of transformers and adds constraints on the shape of the attention matrix, which is a smaller change to the method. In contrast, (b) fundamentally changes how the model is structured by outsourcing interactions to an external component.
 Natalia Neverova, Christian Wolf, Griffin Lacey, Lex Fridman, Deepak Chandra, Brandon Barbello and Graham W. Taylor. Learning Human Identity from Motion Patterns. IEEE Access (4):1810–1820, 2016.
 Cijo Jose, Moustapha Cisse, Francois Fleuret, Kronecker Recurrent Units, ICML 2018.
 Jason Weston, Sumit Chopra, Antoine Bordes, Memory Networks, ICLR 2015.
 Alex Graves, Greg Wayne, Ivo Danihelka, Neural Turing Machines, arXiv:1410.5401, 2014.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, Attention Is All You Need, NeurIPS 2017.
 Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He, Non-local Neural Networks, CVPR 2018.
 Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
 Brendan Duke, Abdalla Ahmed, Christian Wolf, Parham Aarabi and Graham W. Taylor, SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation, CVPR 2021.
 Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Joshua M. Susskind, An Attention Free Transformer, openreview 2020 (ICLR 2020 reject).
 Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh, Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention, arXiv:2102.03902, 2021.
 Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma, Linformer: Self-Attention with Linear Complexity, arxiv:2006.04768, 2020.
 Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira, Perceiver: General Perception with Iterative Attention, arXiv:2103.03206, 2021.