Does padding matter in deep learning models?
The article presents two experiments that show the influence of padding in deep learning models.
Experiment 1
Convolution is shift-equivariant: shifting the input image by 1 pixel shifts the output image also by 1 pixel (see Fig. 1). If we apply global average pooling to the output (i.e., sum over all pixel values), we get a shift-invariant model: no matter how we shift the input image, the output will stay the same. In the language of PyTorch the model looks like this: y = torch.sum(conv(x), dim=(2, 3)) with input x and output y.
Is it possible to use this model to detect the absolute position of a pixel in the image? For a shift-invariant model like the one described, it should not be possible. Let’s train this model to classify images with a single white pixel in it: it should output 1 if the pixel is in the left-upper corner, and 0 otherwise. Training quickly converges, and testing the binary classifier on some images shows that it is perfectly able to detect the pixel position (see Fig. 2).
How does the model learn to classify absolute pixel positions? This is only possible due to the type of padding we use:
- Fig. 3 shows the convolution kernel after some epochs of training
- When using “same” padding (which is used in many popular models), the kernel center is moved over all image pixels (implicitly assuming pixels outside the image have a value of 0)
- This means the right column and bottom row of the kernel will never “touch” the left-upper pixel in the image (otherwise the kernel center would have to move out of the image)
- However, the right column and/or bottom row of the kernel touches all other pixels when moved over the image
- This difference of how pixels are treated is exploited by our model
- Only the positive (yellow) kernel values get applied to the left-upper white pixel, thereby only producing positive values, which gives a positive sum
- For all other pixel positions, also the strongly negative kernel values (blue, green) are applied, which gives a negative sum
Even though the model should be shift-invariant, it is not. The problem occurs close to the image border caused by the type of padding used.
Experiment 2
Does the impact of an input pixel on the output depend on its absolute location? Let’s try this by again using a black image with only a single white pixel in it. This image is fed into a neural network consisting of one single convolution layer (all kernel weights are set to 1, and the bias terms to 0). The impact of an input pixel is measured by summing over the pixel values of the output image. “Valid” padding means that the complete kernel stays within the boundaries of the input image, while “same” padding was already defined.
Fig. 4 shows the impact of each input pixel. For “valid” padding the results look as follows:
- There is only a single position for which the kernel touches a corner point of the image, which is reflected by the value of 1 for the corner pixels
- For each edge pixel, there are 3 positions where the 3×3 kernel touches that pixel
- And for a pixel in general position, there are 9 kernel positions for which pixel and kernel touch
The impact of pixels near the boundary on the output is much lower than for pixels in the center, which might make the model fail when relevant image details are close to the boundary. For “same” padding, the effect is less severe, still, there are fewer “paths” from an input pixel to the output.
A final experiment (see Fig. 5) shows what happens when starting with a 28×28 input image (e.g., an image from the MNIST dataset) and feeding it into a neural network with 5 convolution layers (e.g., a simple MNIST classifier might look like this). Especially for “valid” padding there are now large image regions that are almost completely ignored by the model.
Conclusion
The two experiments have shown that the choice of padding matters, and that some bad choices might result in a low-performing model. For more details see the following papers, which also propose solutions on how to tackle the problems: