Review — DLA: Deep Layer Aggregation
Deep Layer Aggregation
DLA, by UC Berkeley
2018 CVPR, Over 600 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Semantic Segmentation
- Compounding and aggregating representations improves inference of what and where.
- Deep layer aggregation structures iteratively and hierarchically merge the feature hierarchy to make networks with better accuracy and fewer parameters.
Outline
- Conventional Aggregation
- Proposed Aggregation Types
- DLA: Network Architecture
- Experimental Results
1. Conventional Aggregation
- Aggregation (Green) is defined as the combination of different layers.
- Layers are grouped into blocks, which are then grouped into stages by their feature resolution. Authors are concerned with aggregating the blocks and stages (Black).
1.1. (a) No Aggregation
- In convention, blocks are stacked to form a network.
1.2. (b) Shallow Aggregation
- Skip connections, which are commonly used for tasks like segmentation and detection, are used for aggregation, but does so only shallowly by merging earlier parts in a single step each.
2. Proposed Aggregation Types
- Only (c) Iterative Deep Aggregation (IDA) and (f) Hierarchical Deep Aggregation (HDA) are the ones mentioning in the experimental results.
2.1. Iterative Deep Aggregation (IDA)
- (c) Iterative Deep Aggregation: aggregates iteratively by reordering the skip connections of (b) such that the shallowest parts are aggregated the most for further processing.
- The iterative deep aggregation function I for a series of layers x1, …, xn with increasingly deeper and semantic information is formulated as:
- where N is the aggregation node.
2.2. Tree-Structured Aggregation
- (d) Tree-Structured Aggregation: aggregates hierarchically through a tree structure of blocks to better span the feature hierarchy of the network across different depths.
2.3. Reentrant Aggregation, and Hierarchical Deep Aggregation (HDA)
- (e) Reentrant Aggregation and (f) Hierarchical Deep Aggregation (HDA): are refinements of (d) that deepen aggregation by routing intermediate aggregations back into the network and improve efficiency by merging successive aggregations at the same depth.
- (e): propagates the aggregation of all previous blocks instead of the preceding block alone to better preserve features.
- (f): For efficiency, aggregation nodes are merged of the same depth (combining the parent and left child).
- The HDA function Tn, with depth n, is formulated as:
- where N is the aggregation node.
- R and L are defined as:
- where B represents a convolutional block.
3. DLA: Network Architecture
- Deep layer aggregation (DLA) learns to better extract the full spectrum of semantic and spatial information from a network.
- Iterative connections join neighboring stages to progressively deepen and spatially refine the representation.
- Hierarchical connections cross stages with trees that span the spectrum of layers to better propagate features and gradients.
- In base aggregation, aggregation mode N is:
- where σ is the activation function.
- If residual connections are added:
DLA makes no requirements of the internal structure of the blocks and stages. DLA connects across stages with IDA, and within and across stages by HDA.
- For segmentation, the conversion from classification DLA to fully convolutional DLA is simple and no different than for other architectures.
- IDA for interpolation increases both depth and resolution by projection and upsampling.
- Stages are fused from shallow to deep to make a progressively deeper and higher resolution decoder.
4. Experimental Results
4.1. Classification
- Different DLA models are built as above.
- DLA-34 and ResNet-34 both use basic blocks, but DLA-34 has about 30% fewer parameters and ~1 point of improvement in top-1 error rate.
- DLA-X-102 has nearly the half number of parameters compared to ResNeXt-101, but the error rate difference is only 0.2%.
- Compared with DenseNet, DLA achieves higher accuracy with lower memory usage because the aggregation node fan-in size is log of the total number of convolutional blocks in HDA.
- Compare to SqueezeNet, which shares a block design similar to DLA, DLA is more accurate with the same number of parameters.
3.2. Fine-grained Recognition
- DLAs improve or rival the state-of-the-art without further annotations or specific modules for fine-grained recognition.
- DLAs are competitive with VGGNet and ResNet while having only several million parameters, however, not better than state-of-the-art on Birds, although note that this dataset has fewer instances per class so further regularization might help.
3.3. Semantic Segmentation
- Surprisingly, DLA-34 performs very well on Cityscapes and it is as accurate as DLA-102.
- Test evaluation in the same multi-scale fashion as RefineNet with image scales of [0.5, 0.75, 1, 1.25, 1.5] and sum the predictions. DLA improves RefineNet by 2+ points, outperforms FCN by a large margin.
- Higher depth and resolution help. DLA is state-of-the-art, outperforms SegNet, DeepLabv1, DilatedNet and FSO.
- Iterative Deep Aggregation (IDA) is later on used in SqueezeNext.
Reference
[2018 CVPR] [DLA]
Deep Layer Aggregation
Image Classification
1989–2018: … [DLA]
2019: [ResNet-38] [AmoebaNet] [ESPNetv2] [MnasNet] [Single-Path NAS] [DARTS] [ProxylessNAS] [MobileNetV3] [FBNet] [ShakeDrop] [CutMix] [MixConv] [EfficientNet] [ABN] [SKNet] [CB Loss] [AutoAugment, AA] [BagNet]
2020: [Random Erasing (RE)] [SAOL] [AdderNet]
2021: [Learned Resizer]
Semantic Segmentation
2015–2018: … [DLA]
2019: [ResNet-38] [C3] [ESPNetv2] [ADE20K]
2020: [DRRN Zhang JNCA’20]