Brief Review — DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs

Revitialized DenseNet (RDNet), Surpass Swin Transformer, ConvNeXt, and DeiT-III, Match MogaNet

4 min read4 days ago

DenseNets Reloaded: Paradigm Shift Beyond ResNets and ViTs
RDNet, by NAVER Cloud AI, and NAVER AI Lab
2024 ECCV (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2] [MobileViTv2] [ConvNeXt V2] [SwiftFormer] [OpenCLIP] 2024 [FasterViT] [CAS-ViT] [TinySaver]
==== My Other Paper Readings Are Also Over Here ====

DenseNets are revisited to have architectural adjustments, block redesign, and improved training recipes towards model widening and boosting memory efficiency.
Finally, RDNet is formed and ultimately surpass Swin Transformer, ConvNeXt, and DeiT-III, and match MogaNet.

Outline

RDNet
Results

1. RDNet

1.1. Modern Training Setup

(It is better to understand DenseNet first before reading this story.)

Starting with DenseNet-121 as baseline, modern training setup is applied like label smoothing, RandAugment, Random Erasing, Mixup, CutMix, Stochastic Depth.
The followings are the practical details for refining DenseNet:

1.1.1. Going wider and shallower:

The network is widen by augmenting growth rate (GR) while diminishing its depth. Specifically, GR is vastly increased from 32 to 120 here — to achieve it.
The number of blocks per stage is adjusted, being reduced from (6, 12, 48, 32) to a much smaller (3, 3, 12, 3) for a depth adjustment.
There are around 35% and 18% decreases in training speed and memory, respectively. The marked increase in GFLOPs to 11.1 will be adjusted through the later elements.

1.1.2. Improved feature mixers

Layer Normalization (LN) is used instead of Batch Normalization (BN); post-activation is used; depthwise convolution is used; fewer normalizations and activations; and a kernel size of 7 is used.
This design improves accuracy by a large margin (+0.9%p) while slightly increasing computational costs.

1.1.3. Larger intermediate channel dimensions:

A large input dimension for the depthwise convolution is crucial. The intermediate tensor size within the block is enlarged beyond input dimensions (e.g., Expansion Ratio, ER, was tuned to 6).
GR then can be halved, e.g.: from 120 to 60.
This achieve both a faster training speed of 21% and 0.4%p improvement in accuracy.

1.1.4. More transition layers

A transition layer is used in a stage, not solely after each stage, but after every three blocks with a stride of 1.
These transition layers focus on dimension reduction rather than downsampling.
This change frequently often improves accuracy.

1.1.5. Patchification stem

Using image patches as inputs within a stem. The setup of a patch size 4 with a stride 4 is used.
This yields a notable acceleration in computational speed without loss of precision.

1.1.6. Refined transition layers

Removing the average pooling and replacing the convolution by adjusting the kernel size and stride (LN replaces BN).

1.1.7. Channel re-scaling

Channel re-scaling is required due to the diverse variance of concatenated features.
It achieves a slight +0.2%p improvement.

1.2. Revitialized DenseNet (RDNet)

A family of RDNet is constructed with different sets of GR and Block (B), which as shown above.

2. Results

2.1. ImageNet

RDNets slightly fall behind in accuracy, they significantly make up with speed metrics.

For example, RDNet-S can match with other lighter models such as SMT-S or MogaNet-S. Notably, RDNets do not require large memory usage as RDNet aimed but achieve further efficiency.