What is translation equivariance, and why do we use convolutions to get it?
Multi-layer Perceptrons (MLPs) are standard neural networks with fully connected layers, where each input unit is connected with each output unit. We rarely use these kind of layers when working with images, at least not in the early stages of processing. Why is this so? Let’s imagine the following situation, where an input image of size 400x400 pixels and 3 color channels (red, green, blue) is flattened into a vector of 480 000 input values and then fed into an MLP with a single hidden layer with 500 hidden units:
First, we see that the amount of parameters for the first trainable layer is extremely high, since each of the 480 000 input values connected to each of the 500 hidden units, requiring 480 000 parameters (not counting bias values). Current state-of-the-art computer vision models have around 20–30 million trainable parameters, but bigger models exist. More importantly, these parameters are usually distributed over a much larger number of layers, not a single layer.
Secondly, what will happen if the object of interest (here, the monkey, let us suppose we would like to recognize monkeys in images) changes its position in the image?
The fact that the monkey moved from the left part of the image to the right part of the image means that different pixels of the image contain the monkey, which again means that different trainable parameters are responsable for firing when the monkey appears. As a consequence, the same reasoning (“fire when monkey appears”) needs to be performed, and thus learned, for different subsets of parameters corresponding to different locations. This is a tremendous waste of computation, representation, and requires the network to relearn the same reasoning, again and again, multiple times. This will also hurt generalization performance of the model.
Our real objective is to find a neural mapping from images to predictions, which is invariant to all different geometric transformations to which objects could be subject to, for instance translations, rotations and projective distortions. For humans and animals, who are articulated “objects”, this would also include quite complex deformations. Let us focus on translations only for the moment.
Translation invariance requires, that the output of a mapping / network does not change when the input is translated. This is possible to achieve approximatively, up to a certain amount, but not directly.
We will first investigate a related property, which is translation equivariance.
What is an equivariant mapping?
In short, an equivariant mapping is a mapping which preserves the algebric structure of a transformation. As a particular case, a translation equivariant mapping is a mapping which, when the input is translated, leads to a translated mapping, as illustrated in the following animation:
When the input image at the left (a digit taken from the MNIST dataset and embedded in a larger canvas) is translated by a certain amount, the output feature map is translated by the same amount.
Formally this property is illustrated by the figure below:
Input image X1, showing digit “4” is translated to right, which gives input image X2. F1 and F2 are respectively, the feature maps calculated by a translation equivariant mapping ϕ. In this case, the feature map F2, obtained by passing X2 through ϕ is equivalent to feature map obtained by applying to feature map F1 the same translation T which also had been applied to X1 to get X2.
The animation above has been created by training a 4-layer convolutional neural network on the original MNIST dataset of digits of size 28x28 pixels. The network has the following architecture:
Each layer has kernel size 5x5. The convolutional block is followed by a single fully connected layer, the output layer. Here is the PyTorch code of the model:
self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
self.conv2 = torch.nn.Conv2d(20, 40, 5, 1)
self.conv3 = torch.nn.Conv2d(40, 40, 5, 1)
self.conv4 = torch.nn.Conv2d(40, 5, 5, 1)
self.fc1 = torch.nn.Linear(6*6*5, 10)def forward(self, x):
x = F.relu(self.conv1(x)) # now 24x24
x = F.relu(self.conv2(x)) # now 20x20
x = F.relu(self.conv3(x)) # now 16x16
x = F.relu(self.conv4(x)) # now 12x12
x = F.max_pool2d(x, 2, 2) # now 6x6
x = x.view(-1, 6*6*5)
Training: The model was first trained with cross-entropy loss until convergence. It is not the best model for MNIST, and it achieved a validation error of ~1%. In particular, compared to more frequent models it does not perform any pooling between the convolutional layers, which we here did on purpose.
The feature map shown in the animation above has been produced by the 4th convolutional layer, just before the pooling layer. For the illustration, we performed a k-means clustering of the scalar activations on a batch of 50 images and assigned a color code to each cluster.
Testing/animation: while the model was trained on images of size 28x28, to produce the animations we fed images of larger size (38x38) into the network in order to be able move digits over the input image. It is interesting to note, that this does not require a change of the convolutional part of the model, as convolutions are independent of the input size of the image. It would, however, require a change of the fully connected output layer, which was not used for the visualization above.
As we can see, a stack of convolutional layers produces an equivariant mapping. In what follows, we will try to find out why convolutions have been chosen (by Y. LeCun, and others independently) to obtain this property.
Why do we use convolutions to get translation equivariance?
Let’s first imagine that we do not know what convolutions are. Our only objective is to add shift-equivariance to a neural network, since we just learned about this interesting property. We thus search for a neural network layer which is shift-equivariant.
Let’s also add the additional constraint of linearity. This is of course not a real requirement, since neural network layers can very well be non-linear, as long as they are differentiable (and even non-differentiable ones have traditionally been used, but this leads to approximations). However, the class of non-linear operators is extremely large, has many different functional forms, and it is difficult to find a class which can be easily parameterized such that it can represent universal functions (equivariant ones in our case). In classical MLPs, this problem is solved by restricting individual layers to be linear, followed by pointwise non-linearities (ReLU, tanh, sigmoid etc.). The universal approximation theorem shows that this class of functions can approximate any smooth function under mild assumptions.
The following derivation is taken from the book “Digital Image Processing”, by Bernd Jähne, Springer 1997. Yes, the book is quite old (but excellent), completely unrelated to neural works and machine learning … yet very relevant as we will see.
We thus suppose the existence of a new neural network layer, given as an operator ϕ on images / feature maps f, and which has the following properties:
- It is linear, i.e. given scalars α and β and images f and f’,
- It is shift-equivariant, i.e. given a shift operator m,nS, which shifts images f m pixels vertically and n pixels horizontally,
- It has a an impulse response h, which is not really a contraint. It just means that we know the result when the operator is applied to a Dirac impulse 0,0p centered at the origin (0,0):
We can first express the image/signal f as a linear combination of Dirac impulses p at different pixel locations. In the discrete case (which images are) this is always possible without losing any precision:
We use the linearity property of the operator to move it into the sums:
We can then replace the Dirac impulses at different positions with a single Dirac impuls at position (0,0), which is then for each pixel shifted by an amount (m,n) in y and x direction corresponding to the position of the pixel:
We can use the required shift-equivariance property of our operator ϕ to move it over the shift operator S:
Then, ϕ(0,0 p) is nothing else then the “impulse response”, so the output of our operator when applied to a Dirac at the origin, which we defined as being “h”. This is just a change of notation:
We can now express the result of the shift operator by integrating its effect into the position arguments of h:
Last, we can perform a change of variables, namely m’=x-m and n’=y-m, and we get
And this is …. a convolution operator which convolves the image f with a filter kernel h, which is equivalent to the impulse response of our operator.
This means, that the only linear and shift-equivariant operators are convolutions, and this is the reason we put convolutions into neural networks.
How about translation invariance?
While convolutions are translation equivariant and not invariant, an approximative translation invariance can be achieved in neural networks by combining convolutions with spatial pooling operators.