Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).


The big idea is that each layer takes a concatenation of all preceding layers as its input. Hence, if each layer has $C$ output channels, layer $i$ has $iC$ input channels. If we let $L$ be the number of layers, this naively increases the number of weights from $LC^2$ to $\frac{L(L+1)}{2}C^2$.

The intention is to

  1. make adding an additional layer never make the network worse.
  2. avoid vanishing gradients.

The authors note Recent variations of ResNets [13] show that many layers contribute very little and can in fact be randomly dropped during training. This makes the state of ResNets similar to (unrolled) recurrent neural networks [21], but the number of parameters of ResNets is substantially larger because each layer has its own weights. Our proposed DenseNet architecture explicitly differentiates between information that is added to the network and information that is preserved.

From this background knowledge, it makes intuitive sense that most channels in most layers are just passing information forward. This is fine, but the existence of these channels dramatically increases the number of weights to learn. The authors find that since DenseNet doesn't require these useless channels, each layer can make do with just 12 and still outperform current state-of-the-art results. This along with DenseNet's relatively few layers means that it actually contains far fewer weights than most of its competitors.

While concatenating all earlier layers to from the input to the next layer is the primary idea behind DenseNet, there are some other issues to consider:

  • Downsampling: At the end of the day, we want to the number of "pixels" in the image to feed into the last layer of the network. To accomplish this, the authors construct DenseNet out of smaller "dense blocks". The output of one dense block gets downsampled by a "transition layer" before being passed into the next dense block.
  • Bottleneck Layers: To reduce the computing power and the number of parameters, they introduce 1x1 convlutions between layers to reduce the number of channels to only $4k$ (where $k$ is the number of channels).
  • Compression: Between dense blocks, they insert transition layers to further reduce the number of channel.

At the end of the day, they tried four different architectures that differed only in the number of channels at various layers. They all contains

  1. an initial 7x7 convolutional layer with a stride of 2,
  2. a 3x3 max pooling layer with a stride of 2,
  3. four dense blocks with (three) transition layers between them,
  4. and, finally, a softmax classification layer.

Sections 4-6