- Title: Multi-Scale Context Aggregation by Dilated Convolutions
- Authors: Fisher Yu, Vladlen Koltun
- Link: https://arxiv.org/abs/1511.07122
- Tags: Neural Network, dilated convolution, dense prediction, segmentation, VGG
- Year: 2016
-
What
- They describe a variation of convolutions that have a differently structured receptive field.
- They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling).
-
How
- One can image the input into a convolutional layer as a 3d-grid. Each cell is a "pixel" generated by a filter.
- Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other.
- In dilated convolutions, the cells are not right next to each other. E.g. 2-dilated convolutions skip 1 cell between each input cell, 3-dilated convolutions skip 2 cells etc. (Similar to striding.)
- Normal convolutions are simply 1-dilated convolutions (skipping 0 cells).
- One can use a 1-dilated convolution and then a 2-dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing.
- Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution.
- They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.)
Receptive fields of a 1-dilated convolution (1st image), followed by a 2-dilated conv. (2nd image), followed by a 4-dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.
- Results
- They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept).
- They then used the network to segment images.
- Their results were significantly better than previous methods.
- They also added another network with more dilated convolutions in front of the VGG one, again improving the results.
Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.