Review — ACMix: On the Integration of Self-Attention and Convolution

ACMix, a MIXed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while

A sketch of ACmix (Right).
  • ACmix, A mixed model that enjoys the benefit of both self-Attention and Convolution, is proposed.

Outline

  1. Decomposition of Convolution and Self-Attention
  2. ACmix
  3. Results

1. Decomposition of Convolution and Self-Attention

  • After decomposition of convolution and self-attention, they are mixed/combined as one module.

1.1. Decomposition of Convolution

Convolution
  • The standard convolution is:
  • It is rewritten as the summation of the feature maps from different kernel positions:
  • where:
  • To further simplify the formulation, with the definition of the Shift operation:
  • Then, g(p,q)ij is rewritten as:
  • As a result, the standard convolution can be summarized as two stages:

1.2. Decomposition of Self-Attention

Self-Attention
  • Consider a standard self-attention module with N heads. Then, output of the attention module is computed as:
  • where || is the concatenation of the outputs of N attention heads. And the attention weights are computed as:
  • Multi-head self-attention can be decomposed into two stages, and reformulated as:

2. ACmix

2.1. Integration of Self-Attention and Convolution

ACmix
  • ACmix also comprises two stages.
  • At Stage I, input feature is projected by three 1×1 convolutions and reshaped into N pieces, respectively, obtaining a rich set of intermediate features containing 3×N feature maps.
  • At Stage II, there are self-attention path and convolution path.
  • For the self-attention path, the corresponding three feature maps serve as queries, keys, and values, following the traditional multi-head self-attention modules.
  • For the convolution path with kernel size k, a light fully connected layer is adopted and k² feature maps are generated, with also shift operation and aggregation.
  • Finally, outputs from both paths are added together and the strengths are controlled by two learnable scalars:

2.2. Improved Shift and Summation

Practical improvements on shift operations. (a) Simple implementation with tensor shifts. (b) Fast implementation with carefully designed group convolution kernels. (c) Further adaptations with learnable kernels and multiple convolution groups.
  • Despite that they are theoretically lightweight, shifting tensors towards various directions practically breaks the data locality and is difficult to achieve vectorized implementation.
  • Depthwise convolution with fixed kernels is applied as a shift.
  • Take Shift(f, −1,−1) as an example, shifted feature is computed as:
  • A fixed kernel can be used for shift operation:
  • The corresponding output can be formulated as:
  • On this basis, several adaptations are additionally introduced to enhance the flexibility of the module.

2.3. Computational Cost of ACmix

FLOPs and Parameters for different modules at two stages.
  • The computational cost and training parameters at Stage I are the same as self-attention and lighter than traditional convolution.
  • At Stage II, ACmix introduces additional computation overhead with a light fully connected layer and a group convolution. Computation complexity is linear with regard to channel size C and comparably minor with Stage I.

2.4. Generalization to Other Attention Modes

  • The proposed ACmix is independent of self-attention formulations, and can be readily adopted on the different variants. Specifically, the attention weights can be summarized as

3. Results

3.1 ImageNet

Comparisons of FLOPS and parameters against accuracy on ImageNet classification task.

3.2. ADE20K

ADE20K segmentation with Transformer-based models.

3.3. COCO

Left: COCO Object detection with ResNet-based models. Middle: COCO Object detection with Transformer-based models. Right: Practical inference speed on COCO.

3.4. Ablation Study

Left: Ablation study on combining methods of two paths. Fout = α · Fatt + β · Fconv. Middle: Ablation study of shift modules implementations based on Swin-Transformer-T. Right: |α|, |β| and log(|α/β|) from different layers of SAN-ACmix and Swin-ACmix.

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store