# Review — GhostNetV2: Enhance Cheap Operation with Long-Range Attention

## GhostNetV2, Improves GhostNetV1 With DFC Attention

GhostNetV2: Enhance Cheap Operation with Long-Range Attention,GhostNetV2, by Peking University, Huawei Noah’s Ark Lab, and University of Sydney,2022 NeurIPS(Sik-Ho Tsang @ Medium)

Image Classification1989 … 2022[ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III]2023[Vision Permutator (ViP)]

==== My Other Paper Readings Are Also Over Here ====

- A hardware-friendly attention mechanism,
**Decoupled Fully Connected (DFC) Attention**, is proposed based on fully-connected layers, which can not only execute fast on common hardware but also capture the dependence between long-range pixels. We further . - By revisiting the expressiveness bottleneck of GhostNet, a new
**GhostNetV2**architecture is presented for mobile applications that can**aggregate local and long-range information simultaneously.**

# Outline

**Brief Review of****GhostNetV1****GhostNetV2****Results**

**1. Brief Review of **GhostNetV1

## 1.1. Ghost Module in GhostNetV1

- Given
**input feature**with height*X**H*, width*W*, and channel’s number*C*, a typical**Ghost module**can**replace a standard convolution by two steps.** - Firstly, a
**1×1 convolution**is used to**generate the intrinsic feature**, i.e.:

- where
has the*Y*’**channel size usually smaller than the original output features, i.e.,***C*’*out*<*Cout*. - Then,
**cheap operations (e.g., depth-wise convolution)**are used to**generate more features**based on the intrinsic features. - The
**two parts of features**are**concatenated**along the channel dimension, i.e.:

- where
is the*Fdp***depth-wise convolutional filter**, andis the*Y***output**feature which has the**channel size**of.*Cout*

## 1.2. GhostNetV1 Bottleneck

**(a) Left**: A block of GhostNetV1 is constructed by**stacking two Ghost modules**.**(a) Right**: Similar to MobileNetV2, it is also an**inverted bottleneck**, i.e., the**first Ghost module**acts as an expansion layer to**increase the number of output channels**, and the**second Ghost module reduces the channels’ number**to match the shortcut path.

## 1.2. Problem of Ghost Module in GhostNetV1

- Though Ghost module can reduce the computational cost significantly, the
**representation ability is inevitably weakened.**

While in GhostNetV1, the

spatial informationis only captured by the cheap operations (usually implemented by3×3 depth-wise convolution) forhalf of the features. Theremaining featuresare just produced by1×1 point-wise convolution, without any interaction with other pixels.

# 2. **GhostNetV2**

## 2.1. Desired Attention Properties

- A
**desired attention**is expected to have the following properties:

**Long-range:**It is vital to**capture the long-range spatial information**for attention to enhance the representation ability, as a light-weight CNN (e.g., MobileNet [13], GhostNet [8]) usually adopts small convolution filters (e.g., 1×1 convolution) to save computational cost.**Deployment-efficient:**The attention module should be extremely efficient to**avoid slowing the inference down**. Expensive transformations with high FLOPs or hardware-unfriendly operations are not preferred.**Concept-simple:**for the model’s generalization on diverse tasks.

## 2.1. Decoupled Fully Connected (DFC) Attention

- Though
**self-attention**operations, as in**ViT****,****Swin Transformer****, or****MobileViT**,**can model the long-range dependence well**, they are**NOT deployment-efficient.**

Fully-connected (FC) layerswith fixed weights aresimplerandeasierto implement.

- Given
**a feature**of*Z***size**, it can be*H*×*W*×*C***seen as**, i.e.,*HW*tokens*zi*of size*C**Z*={*z*11,*z*12, …,*zHW*}. A**direct implementation of FC layer**to generate the attention map is formulated as:

- where ⊙ is element-wise multiplication,
is the learnable weights in the*F***FC layer**, and A is the generated attention map. - It is much
**simpler than the typical self-attention**. However, its computational process**still requires quadratic complexity of O(**.*H²*×*W²*) - The feature’s 2D shape naturally provides a perspective to reduce the computation of FC layers, i.e.,
**decomposing the above equation into two FC layers**and**aggregating features along the horizontal**and**vertical**directions, respectively. It can be formulated as:

- where
*FH*and*FW*are transformation weights. The above 2 equations are applied to the features**sequentially, capturing the long-range dependence along the two directions, respectively.** **Two depth-wise convolutions**with**kernel sizes 1×**are*K_H*and*K_W*×1**sequentially**applied on the input feature. When implemented with convolution, the theoretical**complexity**of DFC attention is denoted as**O(**.*K_H*×*H*×*W*+*K_W*×*H*×*W*)

The

attentionmechanism in MobileViT [23]only adds about 20% theoretical FLOPs, but requires2× inference timeon a mobile device. Thelarge difference between theoretical and practical complexityshows that it isnecessary to design a hard-ware friendly attentionmechanismfor fast implementation on mobile devices.

## 2.2. Enhanced Ghost Module

- The
**input feature**is sent to two branches, i.e.,*X***one**is the**Ghost module**to**produce output feature**, and the*Y***other**is the**DFC module**to**generate attention map***A*. - To transform input feature into query and key for calculating attention maps,
**a 1×1 convolution is used to convert module’s input***X*into DFC’s input*Z*. - The
**final output**of themodule is the*O***product of two branch’s output**:

## 2.3. Feature Downsampling

- DFC attention increase the computation. To reduce the coomplexity,
**the feature’s size is reduced by down-sampling it both horizontally and vertically**, so that all the**operations in DFC attention**can be conducted on the**smaller features.** - By default, the width and height are both scaled to half of their original lengths, which
**reduces 75% FLOPs of DFC attention.** - Then produced feature map is then upsampled to the original size to match the feature’s size in Ghost branch. The
**average pooling**and**bilinear interpolation**are naively used for downsampling and upsampling, respectively. - Also, directly implementing sigmoid (or hard sigmoid) function will incur longer latency,
**the sigmoid function is also applied on the downsampled features to accelerate practical inference**. Though the value of attention maps may not be limited in range (0,1) strictly, we empirically find that its impact on the final performance is negligible.

## 2.4. GhostNetV2 Bottleneck

- Similar to the idea of GhostNetV1 bottleneck,
**a DFC attention branch is parallel with the first Ghost module to enhance the expanded features**. - Then, the enhanced features are sent to the second Ghost module for producing output features.

# 3. Results

## 3.1. ImageNet

GhostNetV2achievessignificantly higher performancewithlower computational cost.

- For example,
**GhostNetV2 achieves 75.3% top-1 accuracy with only 167 FLOPs**, which**significantly outperform****GhostNetV1****(74.5%)**with similar computational cost (167M FLOPs).

Owing to the deploying efficiency of DFC attention, GhostNetV2 also achieves agood trade-off between accuracy and practical speed.

- For example, with
**similar inference latency (e.g., 37 ms)**,**GhostNetV2**achieves 75.3% top-1 accuracy, which is**obviously better than****GhostNetV1**with 74.5% top-1 accuracy.

## 3.2. MS COCO

With

different input resolutions,GhostNetV2 shows obvious superiority to theGhostNetV1.The proposedDFC attentioncan effectively endow alarge receptive fieldto the Ghost module, and thenconstruct a more powerful and efficient block.

## 3.3. ADE20K

GhostNetV2 also achieves

significantly higher performancethan GhostNetV1.

## 3.4. Ablation Studies & Visualizations

Applying attention on MobileNetV2, the proposed

DFC attention achieves higher performance than these existing attention methods, e.g.: SENet, CBAM, and Coordinate Attention (CA).

**Table 6: Increasing the kernel size**to capture the longer range information can**significantly improve the performance.****Table 7: DFC attention can improve performance**when implementing it on any stage.**Table 8**: With similar computational costs,**enhancing the expanded features brings 1.4% top-1 accuracy improvement**, which is much higher than enhancing the output feature.**Table 9**: Though sigmoid and hard sigmoid functions bring obvious performance improvement, directly implementing them on the large feature maps incur long latency.**Implementing sigmoid before up-sampling is much more efficient**but results in**similar accuracy.****Table 10**: Max pooling is slightly more efficient than average pooling (37.5 ms vs.38.4ms), and bilinear interpolation is faster than the bicubic one (37.5 ms vs.39.9 ms). Authors said that the**maxing pooling for down-sampling and bilinear interpolation for up-sampling are chosen by default. (But in the methodology part and also****GitHub****, authors mentioned to use average pooling. So it should be average pooling?)****Figure 6**: In low layers, the decoupled attention shows some cross-shaped patterns, indicating patches from the vertical/horizontal lines participate more.**As the depth increases, the pattern of the attention map diffuses and becomes more similar to the full attention.**