Review — DynaMixer: A Vision MLP Architecture with Dynamic Mixing
DynaMixer: A Vision MLP Architecture with Dynamic Mixing,
DynaMixer, by Data Platform, Tencent; Graduate school at ShenZhen, Tsinghua University; School of Electrical and Computer Engineering, Peking University; and Tencent AI Lab,
2022 ICML (Sik-Ho Tsang @ Medium)
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] [GhostNetV2] 2023 [Vision Permutator (ViP)] [ConvMixer]
- Existing MLP-like models fuse tokens through static fusion operations, lacking adaptability to the contents of the tokens to be mixed.
- An efficient MLP-like network architecture, dubbed DynaMixer, resorting to dynamic information fusion, to dynamically generate mixing matrices by leveraging the contents of all the tokens to be mixed. To reduce the time complexity and improve the robustness, a dimensionality reduction technique and a multi-segment fusion mechanism are adopted.
1.1. Overall Architecture
- DynaMixer has a standard framework, which consists of a patch embedding layer, several mixer layers, a global average pooling layer, and a classifier head.
- Except for the layer-normalization layers and skip connections, the mixer layer contains a DynaMixer block and a channel-MLP block, which are responsible for fusing token information and channel information, respectively.
- The channel-MLP block is just a feed-forward layer as in Transformer.
- The principle of the design is to generate a dynamic mixing matrix P given a set of input tokens X of size N×D by considering their contents, where N is the number of tokens, and D is the feature dimensionality.
- To do this, X is simply flatten into a vector and generate the mixing matrix P as:
- Once P is obtained, the tokens are mixed by Y=PX to obtain the output tokens Y.
However, the number of parameters of the above process is too large since N×D is usually too big.
1.3. Dimensionality Reduction
- Thus, dimensionality reduction is performed first to reduce the number of parameters:
- where ^X has a smaller size of N×d. And d<<D is a quite small number, say 1 or 2.
1.4. Multi-Segment Fusion Mechanism
- To improve the robustness of the model, features are divided into S segments, the mixing operation is performed separately, and the mixed results are combined to obtain the final results:
- [;] is the concatenation operation, and Wo of size D×D is a feature fusion matrix.
1.5. Weights Sharing Among Segments
- To reduce the number of parameters, the matrices W^(s,i) are shared among all segments.
- Finally, the whole DynaMixer operation is denoted as:
1.6. DynaMixer Block
- DynaMixer block consists of three components, which are row mixing, column mixing, and channel mixing.
- In row mixing, the parameters of DynaMixer operations are shared among all rows.
- Similarly for column mixing.
- The channel mixing is simply a linear transformation on features.
- The outputs of the three components are denoted as Yh, Yw and Yc.
- The three mixing results are just summed to obtain the final result:
1.7. Model Variants
- There are three versions of DynaMixer, denoted as “DynaMixer-S”, “DynaMixer-M”, and “DynaMixer-L”, according to the model sizes.
- The input image size is 224×224, and the input patch size is 7×7.
- All the proposed models have two stages, and each starts with a patch embedding layer. The patch size for the second stage is 2×2.
2.1. MLP-Like Models on ImageNet
DynaMixer-S model with 26M parameters achieves top-1 accuracy of 82.7%, which has already surpassed most of the existing MLP-like models of all sizes and is even better than gMLP-B with 73M parameters.
Increasing the number of parameters to 57M, DynamMixer-M obtains accuracy 83.7%, which is superior to all MLP-like models.
Further expanding the network to 97M parameters, DynaMixer-L can achieve top-1 accuracy of 84.3%, which is a new state-of-the-art among the MLP-like models.
2.2. SOTA Comparisons on ImageNet
The proposed models still achieve the best performance among models with a similar number of parameters.
- Specifically, the proposed models achieve better performance than Swin Transformer, which is the state-of-the-art Transformer-based model.
- DynaMixer-S achieves accuracy 82.7% with slightly fewer parameters, while Swin-T achieved 81.3%.
- For medium-sized and large-sized models, the proposed models are still better than Swin-S and Swin-B. The proposed model is also better than CrossFormer.
Moreover, DynaMixer-S is also faster than ResMLP B24 with a higher top-1 accuracy.
2.3. Ablation Studies & Downstream CIFAR
- Different settings are trial to prove the effectiveness of model performance. (Please read the paper directly for more details.)