Review — HaloNet: Scaling Local Self-Attention for Parameter Efficient Visual Backbones

HaloNet, Localized Window for Self-Attention

  • A new self-attention model family, HaloNets, is developed:
  • A strided self-attention layer, a natural extension of strided convolutions, is developed.
  • To deal with the computational cost in larger resolutions where global attention is infeasible, the fairly general principle of local processing is followed, a spatially restricted forms of self-attention is formed.


  1. Convolution, Self-Attention, and SASA
  2. HaloNet: Model Architecture
  3. Experimental Results

1. Convolution, Self-Attention, and SASA

  • A general form of a local 2D pooling function that computes an output at location (i, j):
  • where f(i, j, a, b) is a function that returns a weight matrix W.

1.1. Convolution

  • When it is convolution:

1.2. Self-Attention

  • For self-attention, WQ, WK, and WV are learned linear transformations that are shared across all spatial locations, and respectively produce queries, keys, and values when used to transform x:

1.3. SASA

  • In SASA, self-attention is within the local window N(i, j), which is a k×k window centered around (i, j), just like a convolution.

1.4. Computational Cost

2. HaloNet: Model Architecture

HaloNet local self-attention architecture
  • A compromise solution can be achieved by leveraging the idea that neighboring pixels share most of their neighborhood.
  • The FLOPs can be controlled by varying the number of pixels that form a block. We name this strategy blocked local self-attention.
  • The two extremes discussed above are a special case of blocked local self-attention. Global attention corresponds to setting the block size to be the entire spatial extent, while the per-pixel extraction corresponds to setting the block size to be 1.

2.1. Blocked Local Self-Attention

  • For an image with height H=4, width W=4, and c channels with stride 1. Blocking chops up the image into a H/b, W/b tensor of non-overlapping (b, b) blocks.
  • Each block behaves as a group of query pixels and a haloing operation combines a band of h pixels around them (with padding at boundaries) to obtain the corresponding shared neighborhood block of shape (H/b, W/b, b+2h, b+2h, c) from which the keys and values are computed.
  • H/b×W/b attention operations then run in parallel for each of the query blocks and their corresponding neighborhoods.
  • Another perspective is that blocked local self-attention is only translational equivariant to shifts of size b. SASA used the same blocking strategy, but setting h=⌊k/2⌋.
Scaling behavior of self-attention mechanisms. f is the number of heads, b is the size of the block, c is the total number of channels, and h is the size of the halo
  • The above table compares different attention approaches.
The attention downsampling layer subsamples the queries but keeps the neighborhood the same as the the stride=1 case
  • Another difference with SASA is the HaloNet’s implementation of downsampling. HaloNet replaces attention followed by post-attention strided average pooling, by a single strided attention layer that subsamples queries similar to strided convolutions.
  • This change does not impact accuracy while also reducing the FLOPs 4× in the downsampling layers.
Optimizations improve performance
  • Taken together, the speedups produced by these improvements are significant as seen above, with up to 2× improvements in step time.

2.2. HaloNet Variants

HaloNet model family specification
  • The deeper layers of multiscale architectures, smaller spatial dimensions and larger channels. HaloNet also takes advantage of this.
  • HaloNet leverages the structure of ResNets that stack multiple residual bottleneck blocks together, as tabulated above. HaloNet uses a few minor modifications from ResNets:
  1. Adding a final 1×1 convolution before the global average pooling for larger models, following EfficientNet.
  2. Modifying the bottleneck block width factor, which is traditionally fixed at 4.
  3. Modifying the output width multiplier of the spatial operation, which is traditionally fixed at 1.
  4. Changing the number of blocks in the third stage from 4 to 3 for computational reasons because attention is more expensive in the higher resolution layers.
  5. The number of heads is fixed for each of the four stages to (4, 8, 8, 8) because heads are more expensive at higher resolutions.
  • To summarize, the scaling dimensions in HaloNet are: image size s, query block size b, halo size h, attention output width multiplier rv, bottleneck output width multiplier rb, number of bottleneck blocks in the third group l3, and final 1×1 conv width df. The attention neighborhoods range from 14×14 (b=8, h=3) to 18×18 (b=14, h=2).
Configurations of HaloNet models, each of which matches a model from the EfficientNet family in terms of parameters
  • Finally, the HaloNet variants from H0 to H7 are established.

3. Experimental Results

3.1. Comparison with EfficientNet

HaloNets can match EfficientNets on the accuracy vs. parameter trade-off
  • The proposed best model, H7, achieves 84.9% top-1 ImageNet validation accuracy and 74.7% top-1 accuracy on ImageNet-V2.

3.2. Transfer of Convolutional Components to Self-Attention

HaloNet improves more than ResNet with regularizations, but does not improve significantly with architectural modules that strongly benefit ResNet.
  • Starting from a baseline model, adding label smoothing (LS) in Inception-v3, RandAugment (RA), Squeeze-and-Excitation (SE), and SiLU/Swish-1 (SiLU/Sw1).

3.3. Increasing Image Sizes Improve Accuracies

The accuracy gap between HaloNet-50 and ResNet-50 is maintained with increasing image sizes. The HaloNet experiments are annotated with block size (b), halo size (h)

3.4. Relaxing Translational Equivariance

Relaxing translational equivariance improves accuracies

3.5. Convolution-Attention Hybrids Improve the Speed-Accuracy Tradeoff

Replacing attention layers with convolutions in stages 1 and 2 exhibit the best speed vs. accuracy tradeoff

3.6. Window and Halo Size

Increasing window sizes improves accuracy up to a point

3.7. Transfer from ImageNet21k

HaloNet models pretrained on ImagetNet-21k perform well when finetuned on ImageNet

3.8. Detection and Instance Segmentation

Accuracies on object detection and instance segmentation

3.9. Pure Attention Based HaloNet is Slower

Pure attention based HaloNet models are currently slower to train than EfficientNet models
  • However, pure self-attention based HaloNets are currently slower to train than the corresponding EfficientNets and require further optimizations for large batch training.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store