Very Deep Convolutional Networks For Large-Scale Image Recognition
This paper examines a family of neural networks. Each of which is built with a base consisting of two kinds of layers:
- 3 x 3 convolutions with stride 1
- 2 x 2 max-pools with stride 2
- The number of channels starts at 64 and then increases until it reaches 512. The number of convolutional layers ranges from 8 to 16.
They also tried injecting
- A localized normalization layer into one of the six networks.
- A couple 1x1 convolutional layers into another of the six networks.
The top of each network is two fully connected layers with 4096 channels followed by a fully connected network with 1000 channels, followed by a softmax to choose one of the thousand classes.
Every layer uses ReLU.
The networks were trained with the multinomial logistic regression objective using mini-batch gradient descent with
- batch size = 256
- momentum = 0.9
- weight decay = 5*10^-4
- dropout of 0.5 for the first two fully connected layers
The learning rate was started at 0.01 and decreased by a factor of ten every time the validation set accuracy stopped improving. This was done three times. The net was trained after 74 epochs. The authors hypothesize that the depth and small filters caused regularization.
A second technique they used was incrementally adding layers. They trained the shallowest network (8 convolutional layers). Then, to train the next deepest network (10 convolutional layers), they initialized the first four convolutional layers and the fully-connected layers with their values from the original (less deep) network.
They augmented their dataset via random cropping, horizontal flipping, and RGB color shifts.
They trained on the ILSVRC-2012 dataset (1.3 million images, 1000 classes). They found a couple things:
- Injecting a local response normalization layer into the shallowest model (8 convolutional layers) led to no improvement.
- The deeper networks performed better than the shallower ones.
- Adding 1x1 layers helps, but adding yet more 3x3 layers helps more.
- Scale jiggling helps during training even if the test set isn't jiggled.
- TODO: multi-crop
They also found [as usual] that averaging these networks' predictions improves performance beyond any individual network.
The performance of an ensemble of two of these nets is inline with state of the art ensemble learners and they achieved the best performance (at the time) for a singles network.
The authors note that a stack of two 3x3 layers can interpret a 5x5 gride, while a stack of three 3x3 layers can interpret a 7x7 grid. There are some benefits to stacking 3x3s:
- We get to use three ReLUs, which "makes the decision function more discriminative".
- We can decrease the number of parameters: three 3x3 layers with C channels have 27C^2weights, while a 7x7 layer with C channels has 49C^2 weights.