# Brief Review — MobileViTv2: Separable Self-attention for Mobile Vision Transformers

## MobileViTv2 in 2023 TMLR, Improves MobileViTv1 in 2022 ICLR

Separable Self-attention for Mobile Vision Transformers, by Apple Inc.

MobileViTv22023 TMLR, Over 170 Citations(Sik-Ho Tsang @ Medium)

Image Classification[Vision Permutator (ViP)] [ConvMixer] [CrossFormer++] [FastViT] [EfficientFormerV2]

1989 … 2023

==== My Other Paper Readings Are Also Over Here ====

- It’s been a while not having story about image classification paper.
**MobileViTv1****multi-headed self-attention (MHA) in****Transformer****s**, which requires**O(**with respect to the number of tokens.*k*²) time complexity- This paper introduces a
**separable self-attention method with linear complexity, i.e. O(***k*).

# Outline

**MobileViTv2****Results**

**1. MobileViTv2**

## 1.1. MHA in Transformer

- MHA in Transformer
**feeds an input**, comprising of*x**k**d*-dimensional token (or patch) embeddings**to 3 branches, namely query**. Each branch (Q, K, and V) is comprised of*Q*, key*K*, and value*V**h*linear layers. **The dot-product between the output of linear layers in**is then computed simultaneously for all*Q*and*K**h*heads, and is**followed by a softmax operation**to*σ***produce an attention (or context-mapping) matrix**.*a***The outputs of**to produce a tensor with*h*heads are concatenated*k**d*-dimensional tokens, which is then**fed to another linear layer with weights**, to*WO***produce the output of MHA**Mathematically:*y*.

- where the symbol ⟨·, ·⟩ denotes the dot-product operation.

## 1.2. MHA in Linformer

- In
**Linformer**, MHA in (a) is extended by introducing token projection layers, which**project***k*tokens to a pre-defined number of tokens*p*, thus reducing the complexity from O(*k*²) to O(*k*). - However, it
**still uses costly operations**(e.g., batch-wise matrix multiplication)**for computing self-attention.**

## 1.3. Separable self-attention in Proposed MobileViTv2

- Similar to MHA,
**the input**is processed using*x***3 branches**, i.e.,**input**.*I*, key*K*, and value*V* **The input branch**The weights WI serves as the*I*maps each*d*-dimensional token in*x*to a scalar using a linear layer with weights*WI*.**latent node**, as in Fig. 4(b). This linear projection is an inner-product operation and computes the distance between latent token*L**L*and*x*, resulting in a*k*-dimensional vector. A**softmax**operation is then applied to this*k*-dimensional vector to**produce context scores***cs*.**The context scores***cs***compute a context vector**Specifically, the input*cv*.*x*is linearly projected to a*d*-dimensional space using key branch*K*with weights*WK*to produce an output*xK*.**The context vector**then computed as*cv***a weighted sum of**as:*xK*

As seen,

the context vectorin the sense that it alsocvis analogous to the attention matrixaencodes the information from all tokens in the input x, but is cheap to compute.The contextual information encoded incvis shared with all tokens inx.

- In branch
*V*,**the input**is linearly projected with*x***weights**, followed by*WV***ReLU**to**produce an output***xV*. **The contextual information in**The resultant output is then*cv*is then propagated to*xV*via broadcasted element-wise multiplication operation.**fed to**another linear layer with**weights***WO***produce the final output**. Mathematically:*Y*

## 1.4. Model Variants

**The width of MobileViTv2 network is uniformly scaled using a width multiplier**This is in contrast to MobileViTv1 which trains three specific architectures (XXS, XS, and S) for mobile devices.*α*∈ {0.5, 2.0}.

# 2. Results

## 2.1. Inference Time of Self-Attention Blocks

3× improvement in inference speed is obtained by proposed MobileViTv2with similar performance on the ImageNet-1k dataset.

## 2.2. Comparison With MobileViTv1

The proposed separable self-attention is fast and efficientas compared to MobileViTv1.

## 2.3. SOTA Comparisons on Image Classification

MobileViTv2 bridges the latency gap between CNN- andViT-based models on mobile deviceswhile maintaining performance with similar or fewer parameters.

## 2.4. Downstream Tasks

MobileViTv2 delivers

competitive performance at different complexitieswhile having significantly fewer parameters and FLOPs.

MobileViTv2 delivers

competitive performance to models with different capacities, further validating the effectiveness of the proposed self-separable attention method.