Review — Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Axial-DeepLab, for Both Image Classification & Segmentation

Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation,
Axial-DeepLab, by Johns Hopkins University, and Google Research,
2020 ECCV, Over 400 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Panoptic Segmentation, Instance Segmentation, Semantic Segmentation
  • Conventional 2D self-attention which has very high computational complexity is factorized into two 1D self-attentions.
  • A position-sensitive self-attention design is proposed.
  • Combining both yields the position-sensitive axial-attention layer.
  • By stacking the position-sensitive axial-attention layers, Axial-DeepLab models are formed for image classification and dense prediction.


  1. Position-Sensitive Axial-Attention Layer
  2. Axial-DeepLab
  3. Results

1. Position-Sensitive Self-Attention Layer

1.1. Conventional Self-Attention

  • Given an input feature map x with height h, width w, and channels din, the output at position o=(i, j), yo computed by pooling over the projected input as:
  • where N is the whole location lattice, and queries qo=WQxo, keys ko=WKxo, values vo=WVxo are all linear projections of the input xo.
  • However, self-attention is extremely expensive to compute (O(h²w²)). Later in next section, Position-Sensitive Axial-Attention Layer is used to reduce this complexity.
  • Another drawback is that the global pooling does not exploit positional information, which is critical to capture spatial structures or shapes in vision tasks. Position-Sensitive Self-Attention helps to solve this issue.

1.2. Position-Sensitive Self-Attention

  • SASA proposed to include rp-o relative positional encoding:
  • where Nm×m(o) is the local m×m square region around o=(i, j).

In this paper, the relative positional encodings rqp-o, rkp-o, rvp-o for query, key and value are also added:




