Review — FocalNet: Focal Modulation Networks

FocalNet, Focal Modulation Replaces Self-Attention

5 min readJun 7, 2023

**(a) Window-Wise Self-Attention (SA) in** **Swin, (b) Focal Attention (FA) in** **Focal Transformer** **and (c) the Proposed Focal Modulation.**

Focal Modulation Networks,
FocalNet, by Microsoft Research,
2022 NeurIPS (Sik-Ho Tsang @ Medium)
Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer]
==== My Other Paper Readings Are Also Over Here ====

Focal Modulation Network (FocalNet) is proposed, where self-attention (SA) is completely replaced by a focal modulation module that is more effective and efficient for modeling token interactions.
Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges at different granularity levels, (ii) gated aggregation to selectively aggregate context features for each visual token (query) based on its content, and (iii) modulation or element-wise affine transformation to fuse the aggregated features into the query vector.

Outline

Focal Modulation Network (FocalNet)
Results

1. Focal Modulation Network (FocalNet)

1.1. Fig. 2(a) Self-Attention (SA):

Conceptually, Self-Attention (SA) is that:
For each visual token (query) xi, a feature representation yi is generated via the interaction T1 with its surroundings X (e.g., neighboring tokens) and aggregation M1 over the contexts. is defined as below:

1.2. Fig. 2(b) Focal Modulation:

In this paper, Focal Modulation is proposed to replace SA, where the context features are aggregated using M2 first, then the query interacts with the aggregated feature using T2 to fuse the contexts to form yi.

Specifically, in this study, the above focal modulation is formulated as:

where q(.) is a query projection function, and ⊙ is the element-wise multiplication operator. That means the interaction operator T2 is implemented using a simple q(.) and ⊙.
There are intuitions/reasons/requirements provided for each symbol of the above conceptual equations, e.g.: Translation invariance, Explicit input-dependency, Spatial- and channel-specific, and Decoupled feature granularity. (Please feel free to read the paper directly.)

1.3. Fig. 2(c) Context Aggregation via M2

The aggregation procedure consists of two steps:

Hierarchical contextualization to extract contexts from local to global ranges at different levels of granularity, and;
Gated aggregation to condense all context features at different granularity levels into a single feature vector, namely modulator:

1.3.1. Fig. 2(c) Hierarchical Contextualization

Given input feature map X, it is first projected into a new feature space with a linear layer Z0=fz(X).
Then, a hierarchical presentation of contexts is obtained using a stack of L depth-wise convolutions, originated from MobileNetV1.
At focal level l ∈ {1, …, L}, the output Zl is derived by:

where fla is the contextualization function at the l-th level, implemented via a depth-wise convolution Convdw with kernel size kl followed by a GELU.
Furthermore, to capture global context, a global average pooling (Avg-Pool) is applied on the L-th level feature map:

In total (L+1) feature maps {Zl} where l from 1 to L+1, which collectively capture short- and long-range contexts at different levels of granularity.

1.3.2. Fig. 2(c) Gated Aggregation

A gating mechanism to control how much to aggregate from feature maps at different levels.
A linear layer to obtain a spatial- and level-aware gating weights:

A weighted sum through an element-wise multiplication is used to obtain a single feature map Zout which has the same size as the input X:

Until now, all the aggregation is spatial. To model the communication across different channels, another linear layer h(.) is used:

1.3.3. All Together

The pseudo codes are as above.

Finally, Given the implementation of M2 as described above, focal modulation can be rewritten at the token level as:

1.4. Model Variants

Four variants are made to match the complexities of Swin and Focal Transformers.

2. Results

2.1. ImageNet

Table 2: FocalNets outperform the conventional CNNs (e.g., ResNet and the augmented version), MLP architectures such as MLP-Mixer and gMLP, and Transformer architectures DeiT and PVT.
FocalNets with small receptive fields (SRF) achieve consistently better performance than Swin Transformer but with similar model size, FLOPs and throughput.

Table 3: Overlapped patch embedding improves the performance for models of all sizes.
Table 4: Going deeper but thinner improves the performance of FocalNets significantly.
Table 5: With ImageNet-22K pretraining, they consistently outperform Swin Transformers, indicating that FocalNets are equally or more scalable and data-efficient.

FocalNets achieve much better performance than their ViT counterparts, with relatively small reduction of inference speed (18% for tiny and 10% for small and base models). At the tiny scale, FocalNet outperforms ViT by 1.9%. At the base scale, it surpasses ViT by 0.6%.

2.2. MS COCO

Comparing with Swin Transformer, FocalNets improve the box mAP (APb) by 2.2, 1.5 and 1.9 in 1× schedule for tiny, small and base models, respectively.
Remarkably, the 1× performance of FocalNet-T/B (45.9/48.8) rivals Swin-T/B (46.0/48.5) trained with 3× schedule.