Review — GhostNetV2: Enhance Cheap Operation with Long-Range Attention
GhostNetV2, Improves GhostNetV1 With DFC Attention
GhostNetV2: Enhance Cheap Operation with Long-Range Attention,
GhostNetV2, by Peking University, Huawei Noah’s Ark Lab, and University of Sydney,
2022 NeurIPS (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- A hardware-friendly attention mechanism, Decoupled Fully Connected (DFC) Attention, is proposed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further .
- By revisiting the expressiveness bottleneck of GhostNet, a new GhostNetV2 architecture is presented for mobile applications that can aggregate local and long-range information simultaneously.
Outline
- Brief Review of GhostNetV1
- GhostNetV2
- Results
1. Brief Review of GhostNetV1
1.1. Ghost Module in GhostNetV1
- Given input feature X with height H, width W, and channel’s number C, a typical Ghost module can replace a standard convolution by two steps.
- Firstly, a 1×1 convolution is used to generate the intrinsic feature, i.e.:
- where Y’ has the channel size usually smaller than the original output features, i.e., C’out < Cout.
- Then, cheap operations (e.g., depth-wise convolution) are used to generate more features based on the intrinsic features.
- The two parts of features are concatenated along the channel dimension, i.e.:
- where Fdp is the depth-wise convolutional filter, and Y is the output feature which has the channel size of Cout.
1.2. GhostNetV1 Bottleneck
- (a) Left: A block of GhostNetV1 is constructed by stacking two Ghost modules.
- (a) Right: Similar to MobileNetV2, it is also an inverted bottleneck, i.e., the first Ghost module acts as an expansion layer to increase the number of output channels, and the second Ghost module reduces the channels’ number to match the shortcut path.
1.2. Problem of Ghost Module in GhostNetV1
- Though Ghost module can reduce the computational cost significantly, the representation ability is inevitably weakened.
While in GhostNetV1, the spatial information is only captured by the cheap operations (usually implemented by 3×3 depth-wise convolution) for half of the features. The remaining features are just produced by 1×1 point-wise convolution, without any interaction with other pixels.
2. GhostNetV2
2.1. Desired Attention Properties
- A desired attention is expected to have the following properties:
- Long-range: It is vital to capture the long-range spatial information for attention to enhance the representation ability, as a light-weight CNN (e.g., MobileNet [13], GhostNet [8]) usually adopts small convolution filters (e.g., 1×1 convolution) to save computational cost.
- Deployment-efficient: The attention module should be extremely efficient to avoid slowing the inference down. Expensive transformations with high FLOPs or hardware-unfriendly operations are not preferred.
- Concept-simple: for the model’s generalization on diverse tasks.
2.1. Decoupled Fully Connected (DFC) Attention
- Though self-attention operations, as in ViT, Swin Transformer, or MobileViT, can model the long-range dependence well, they are NOT deployment-efficient.
Fully-connected (FC) layers with fixed weights are simpler and easier to implement.
- Given a feature Z of size H×W×C, it can be seen as HW tokens zi of size C, i.e., Z={z11, z12, …, zHW}. A direct implementation of FC layer to generate the attention map is formulated as:
- where ⊙ is element-wise multiplication, F is the learnable weights in the FC layer, and A is the generated attention map.
- It is much simpler than the typical self-attention. However, its computational process still requires quadratic complexity of O(H²×W²).
- The feature’s 2D shape naturally provides a perspective to reduce the computation of FC layers, i.e., decomposing the above equation into two FC layers and aggregating features along the horizontal and vertical directions, respectively. It can be formulated as:
- where FH and FW are transformation weights. The above 2 equations are applied to the features sequentially, capturing the long-range dependence along the two directions, respectively.
- Two depth-wise convolutions with kernel sizes 1×K_H and K_W×1 are sequentially applied on the input feature. When implemented with convolution, the theoretical complexity of DFC attention is denoted as O(K_H×H×W+K_W×H×W).
The attention mechanism in MobileViT [23] only adds about 20% theoretical FLOPs, but requires 2× inference time on a mobile device. The large difference between theoretical and practical complexity shows that it is necessary to design a hard-ware friendly attention mechanism for fast implementation on mobile devices.
2.2. Enhanced Ghost Module
- The input feature X is sent to two branches, i.e., one is the Ghost module to produce output feature Y, and the other is the DFC module to generate attention map A.
- To transform input feature into query and key for calculating attention maps, a 1×1 convolution is used to convert module’s input X into DFC’s input Z.
- The final output O of themodule is the product of two branch’s output:
2.3. Feature Downsampling
- DFC attention increase the computation. To reduce the coomplexity, the feature’s size is reduced by down-sampling it both horizontally and vertically, so that all the operations in DFC attention can be conducted on the smaller features.
- By default, the width and height are both scaled to half of their original lengths, which reduces 75% FLOPs of DFC attention.
- Then produced feature map is then upsampled to the original size to match the feature’s size in Ghost branch. The average pooling and bilinear interpolation are naively used for downsampling and upsampling, respectively.
- Also, directly implementing sigmoid (or hard sigmoid) function will incur longer latency, the sigmoid function is also applied on the downsampled features to accelerate practical inference. Though the value of attention maps may not be limited in range (0,1) strictly, we empirically find that its impact on the final performance is negligible.
2.4. GhostNetV2 Bottleneck
- Similar to the idea of GhostNetV1 bottleneck, a DFC attention branch is parallel with the first Ghost module to enhance the expanded features.
- Then, the enhanced features are sent to the second Ghost module for producing output features.
3. Results
3.1. ImageNet
GhostNetV2 achieves significantly higher performance with lower computational cost.
- For example, GhostNetV2 achieves 75.3% top-1 accuracy with only 167 FLOPs, which significantly outperform GhostNetV1 (74.5%) with similar computational cost (167M FLOPs).
Owing to the deploying efficiency of DFC attention, GhostNetV2 also achieves a good trade-off between accuracy and practical speed.
- For example, with similar inference latency (e.g., 37 ms), GhostNetV2 achieves 75.3% top-1 accuracy, which is obviously better than GhostNetV1 with 74.5% top-1 accuracy.
3.2. MS COCO
With different input resolutions, GhostNetV2 shows obvious superiority to the GhostNetV1. The proposed DFC attention can effectively endow a large receptive field to the Ghost module, and then construct a more powerful and efficient block.
3.3. ADE20K
GhostNetV2 also achieves significantly higher performance than GhostNetV1.
3.4. Ablation Studies & Visualizations
Applying attention on MobileNetV2, the proposed DFC attention achieves higher performance than these existing attention methods, e.g.: SENet, CBAM, and Coordinate Attention (CA).
- Table 6: Increasing the kernel size to capture the longer range information can significantly improve the performance.
- Table 7: DFC attention can improve performance when implementing it on any stage.
- Table 8: With similar computational costs, enhancing the expanded features brings 1.4% top-1 accuracy improvement, which is much higher than enhancing the output feature.
- Table 9: Though sigmoid and hard sigmoid functions bring obvious performance improvement, directly implementing them on the large feature maps incur long latency. Implementing sigmoid before up-sampling is much more efficient but results in similar accuracy.
- Table 10: Max pooling is slightly more efficient than average pooling (37.5 ms vs.38.4ms), and bilinear interpolation is faster than the bicubic one (37.5 ms vs.39.9 ms). Authors said that the maxing pooling for down-sampling and bilinear interpolation for up-sampling are chosen by default. (But in the methodology part and also GitHub, authors mentioned to use average pooling. So it should be average pooling?)
- Figure 6: In low layers, the decoupled attention shows some cross-shaped patterns, indicating patches from the vertical/horizontal lines participate more. As the depth increases, the pattern of the attention map diffuses and becomes more similar to the full attention.