Review — ACMix: On the Integration of Self-Attention and Convolution

ACMix, a MIXed model that enjoys the benefit of both self-Attention and Convolution (ACmix), while

Sik-Ho Tsang
6 min readFeb 3


A sketch of ACmix (Right).

On the Integration of Self-Attention and Convolution,
ACMix, by Tsinghua University, Huawei Technologies Ltd., and Beijing Academy of Artificial Intelligence,
2022 CVPR, Over 40 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Convolution, Self-Attention

  • ACmix, A mixed model that enjoys the benefit of both self-Attention and Convolution, is proposed.


  1. Decomposition of Convolution and Self-Attention
  2. ACmix
  3. Results

1. Decomposition of Convolution and Self-Attention

  • After decomposition of convolution and self-attention, they are mixed/combined as one module.

1.1. Decomposition of Convolution

  • The standard convolution is:
  • It is rewritten as the summation of the feature maps from different kernel positions:
  • where:
  • To further simplify the formulation, with the definition of the Shift operation:
  • Then, g(p,q)ij is rewritten as:
  • As a result, the standard convolution can be summarized as two stages:

At the first stage, the input feature map is linearly projected w.r.t. the kernel weights from a certain position, i.e., (p, q). This is the same as a standard 1×1 convolution.

While in the second stage, the projected feature maps are shifted according to the kernel positions and finally aggregated together.

1.2. Decomposition of Self-Attention

  • Consider a standard self-attention module with N heads. Then, output of the attention module is computed as:
  • where || is the concatenation of the outputs of N attention heads. And the attention weights are computed as:
  • Multi-head self-attention can be decomposed into two stages, and reformulated as:

1×1 convolutions are first conducted in stage I to project the input feature as query, key and value.

On the other hand, Stage II comprises the calculation of the attention weights and aggregation of the value matrices, which refers to gathering local features.

2. ACmix

2.1. Integration of Self-Attention and Convolution

  • ACmix also comprises two stages.
  • At Stage I, input feature is projected by three 1×1 convolutions and reshaped into N pieces, respectively, obtaining a rich set of intermediate features containing 3×N feature maps.
  • At Stage II, there are self-attention path and convolution path.
  • For the self-attention path, the corresponding three feature maps serve as queries, keys, and values, following the traditional multi-head self-attention modules.
  • For the convolution path with kernel size k, a light fully connected layer is adopted and k² feature maps are generated, with also shift operation and aggregation.
  • Finally, outputs from both paths are added together and the strengths are controlled by two learnable scalars:

2.2. Improved Shift and Summation

Practical improvements on shift operations. (a) Simple implementation with tensor shifts. (b) Fast implementation with carefully designed group convolution kernels. (c) Further adaptations with learnable kernels and multiple convolution groups.
  • Despite that they are theoretically lightweight, shifting tensors towards various directions practically breaks the data locality and is difficult to achieve vectorized implementation.
  • Depthwise convolution with fixed kernels is applied as a shift.
  • Take Shift(f, −1,−1) as an example, shifted feature is computed as:
  • A fixed kernel can be used for shift operation:
  • The corresponding output can be formulated as:

Therefore, with carefully designed kernel weights for specific shift directions, the convolution outputs are equivalent to the simple tensor shifts. This modification enables the module with higher computation efficiency.

  • On this basis, several adaptations are additionally introduced to enhance the flexibility of the module.

The convolution kernel is released as learnable weights, with shift kernels as initialization. This improves the model capacity.

2.3. Computational Cost of ACmix

FLOPs and Parameters for different modules at two stages.
  • The computational cost and training parameters at Stage I are the same as self-attention and lighter than traditional convolution.
  • At Stage II, ACmix introduces additional computation overhead with a light fully connected layer and a group convolution. Computation complexity is linear with regard to channel size C and comparably minor with Stage I.

2.4. Generalization to Other Attention Modes

  • The proposed ACmix is independent of self-attention formulations, and can be readily adopted on the different variants. Specifically, the attention weights can be summarized as

3. Results

3.1 ImageNet

Comparisons of FLOPS and parameters against accuracy on ImageNet classification task.

For ResNet-ACmix models, the proposed model outperforms all baselines with comparable FLOPs or parameters.

For SAN-ACmix, PVT-ACmix and Swin-ACmix, the proposed models achieve consistent improvements.

3.2. ADE20K

ADE20K segmentation with Transformer-based models.

Backbones are pretrained on ImageNet-1K. It is shown that ACmix achieves improvements under all settings.

3.3. COCO

Left: COCO Object detection with ResNet-based models. Middle: COCO Object detection with Transformer-based models. Right: Practical inference speed on COCO.

Left & Middle: ACmix consistently outperform baselines with similar parameters or FLOPs.

Right: Comparing to PVT-S, the proposed model achieves 1.3× fps with comparable mAP. When it comes to the larger model, the superiority is more distinct.

3.4. Ablation Study

Left: Ablation study on combining methods of two paths. Fout = α · Fatt + β · Fconv. Middle: Ablation study of shift modules implementations based on Swin-Transformer-T. Right: |α|, |β| and log(|α/β|) from different layers of SAN-ACmix and Swin-ACmix.

Left: The combination of convolution and self-attention modules consistently outperforms models with a single path. Using learned parameters imposes higher flexibility for ACmix.

Middle: By substituting the tensor shifts with group convolutions, inference speed is greatly boosted. Also, using learnable convolution kernels and carefully-designed initialization enhance model flexibility and contribute to the final performance.

Right: α and β practically reflect the model’s bias towards convolution or self-attention at different depths. Convolution can serve as good feature extractors at the early stages of the Transformer models. At the middle stage of the network, the model tends to leverage the mixture of both paths with an increasing bias towards convolution. At the last stage, self-attention shows superiority over convolution.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.