# Review — CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation

## CoTr, Deformable Transformer Used In Between Encoder & Decoder

--

CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation,CoTr, by Northwestern Polytechnical University, and The University of Adelaide,2021 MICCAI, Over 150 Citations(Sik-Ho Tsang @ Medium)

Medical Imaging, Medical Image Analysis, Image Segmentation, Transformer

4.2. Biomedical Image Segmentation[Expanded U-Net] [3-D RU-Net] [nnU-Net] [TransUNet]

2015 … 2021

My Other Previous Paper Readings Are Also Over Here

**Convolutional neural network and a****Transformer****(CoTr)**is proposed, where the**CNN**is constructed to**extract feature representations**and an efficient**deformable****Transformer****(DeTrans)**is built to**model the long-range dependency**on the extracted feature maps.- Different from the vanilla Transformer which treats all image positions equally,
**DeTrans pays attention only to a small set of key positions**by introducing the deformable self-attention mechanism.

# Outline

**CoTr Design Difference****CoTr Model Architecture****Results**

**1. CoTr Design Difference**

**(a) CNN**: encoder is composed of multiple stacked convolutional layers.**(b)****SETR**: encoder is purely formed from self-attention layers, i.e., Transformer.**(c)****TransUNet****and (d) CoTr**encoders are both the hybrid of CNN and Transformer.- Differently,
**TransUNet****only processes the low-resolution feature maps from the last stage**due to the high computation and spatial complexities.

CoTris able toprocess the multi-scale and high-resolution feature maps.

**2. CoTr Model Architecture**

## 2.1. CNN-encoder

- The CNN-encoder,
contains a*FCNN*(·)**Conv-IN-****LeakyReLU****block**and**three stages of 3D residual blocks**. - The
**Conv-IN-****LeakyReLU****block**contains a 3D convolutional layer followed by an instance normalization (IN) [18] and leaky rectified linear unit (LeakyReLU) activation. - The
**numbers of 3D residual blocks**in three stages are**three, three, and two**, respectively.

Given an input image

xwith a height ofH, a width ofW, and a depth (i.e., number of slices) ofD,thefeature maps produced bycan be formally expressed as:FCNN(·)

- where
indicates the*L***number of feature levels**,*Θ*denotes the parameters of the CNN-encoder, and*C*denotes the number of channels.

## 2.2. DeTrans-Encoder

- The CNN-encoder cannot capture the long-range dependency of pixels effectively,
**DeTrans-encoder**is used to introduce the**multi-scale deformable self-attention (MS-DMSA)**mechanism for efficient**long-range contextual modeling**. - The DeTrans-encoder is a composition of
**an input-to-sequence layer**and*LD*stacked deformable**Transformer****(DeTrans) layers**.

## 2.2.1. Input-to-Sequence Transformation

**Feature maps {**where l from 1 to*fl*}*L*is**flattened**into a 1D sequence, yet flattening the features leads to**losing the spatial information.**

The

3D positional encoding sequence using sine and cosine functions with different frequenciesis added to the flatten features to compute thepositional coordinates of each dimension pos:

- where
**# ∈ {**indicates each of three dimensions, and:*D*,*H*,*W*}

For each feature level

l,arePED,PEH, andPEWconcatenatedas the3D positional encodingandplcombine it with the flattenedto form the input sequence of DeTrans-encoder.flvia element-wise summation

## 2.2.2. MS-DMSA Layer

- The self-attention layer in the
**conventional****Transformer****look over all possible locations**in the feature map. It has the drawback of**slow convergence**and**high computational complexity**. - (It is better to read Deformable DETR before here.)

FollowingDeformable DETR, MS-DMSA layer focuses only on a small set of key sampling locations around a reference location, instead of all locations.

- where
is the*K***number of sampled key points**,is a*Ψ*(·)**linear projection layer**,**Λ(**is the*zq*)*ilqk*∈ [0, 1]**attention weights**,**Δ**∈*pilqk**R*³ is the**sampling offset**of the*k*-th sampling point in the*l*-th feature level, andto the*σl*(·) re-scales ˆ*pq**l*-th level feature. **Λ(***zq*)*ilqk***Δ**are obtained via*pilqk***linear projection over the query feature**.*zq*- Then, the
**MS-DMSA**layer can be formulated as:

- where
is the*H***number of attention heads**, andis a*Φ*(·)**linear projection layer**that weights and aggregates the feature representation of all attention heads.

## 2.2.3. DeTrans Layer

- The DeTrans layer is
**composed of a MS-DMSA layer and a feed forward network**, each being followed by the layer norm. - The
**skip connection**strategy is employed in each sub-layer to avoid gradient vanishing. - The DeTrans-encoder is constructed by
**repeatedly stacking DeTrans layers**.

## 2.3. Decoder

- The decoder is a
**pure CNN**architecture, which**progressively upsamples the feature maps to the input resolution**. - Besides, the
**skip connections between encoder and decoder**are also added to**keep more low-level details**for better segmentation. - Deep supervision strategy is used by adding
**auxiliary losses**to the**decoder outputs with different scales.** - The loss function is the
**sum of the Dice loss and cross-entropy loss**.

## 2.4. Details

- The
**hidden size**in**MS-DMSA**and**feed forward network**to**384**and**1536**respectively. ,*LD*=6, and*H*=6.*K*=4- Besides,
**two variants**of CoTr with small CNN-encoders are formed, denoted as**CoTr∗**and**CoTr†**. - In
**CoTr∗**, there is**only one 3D residual block in each stage of CNN-encoder**. - In
**CoTr†**, the**number of 3D residual blocks in each stage**of CNN-encoder is**two**.

# 3. Results

## 3.1. SOTA Comparison

- First, although the Transformer architecture is not limited by the type of input images, the
**ViT-B/16 pre-trained on 2D natural images does not work well on 3D medical images**. The**suboptimal performance**may be attributed to the domain shift between 2D natural images and 3D medical images. - Second, ‘CoTr w/o CNN-encoder’ has about 22M parameters and outperforms the SETR with about 100M parameters. It is believed that a
**lightweight****Transformer****more friendly for medical image segmentation tasks**, where there is usually a small training dataset. - Third,
**CoTr∗****with comparable parameters****significantly outperforms ‘CoTr w/o CNN-encoder’**, improving the average Dice over 11 organs by 4%.

It suggests that the

hybrid CNN-Transformerencoderhasdistinct advantages over the pureTransformerencoderin medical image segmentation.

- The same CNN-encoder is used and decoder but
**DeTrans-encoder is replaced with ASPP (****DeepLabv3****), PP (****PSPNet****), and****Non-Local****modules**, respectively.

CoTr elevates consistently the segmentation performanceover ‘CoTr w/o DeTrans’ on all organs andimproves the average Dice by 1.4%.

- The original
**2D****TransUNet****extended**to a**3D**version by using 3D CNN-encoder and decoder is as done in CoTr.

CoTr steadily beatsTransUNetin the segmentation of all organs, particularly for the gallbladder and pancreas segmentation.

## 3.2. Ablation Study

**(a-c)**: In the DeTrans-encoder, there are**three hyper-parameters, i.e.,**, which represent the*K*,*H*, and*LD***number of sampled key points, heads, and stacked DeTrans layers**, respectively.- To investigate the impact of their settings on the segmentation,
is set to*K***1, 2, and 4**,is set to*H***2, 4, and 6**, andis set to*LD***2, 4, and 6**.

It shows that

increasing the number of.K,H, orLDcan improve the segmentation performance

**(d)**: To demonstrate the performance gain resulted from the multi-scale strategy, it is also attempted to train**CoTr with single-scale feature maps from the last stage**.- Using
**multi-scale feature maps**instead of single-scale feature maps can effectively**improve the average Dice by 1.2%.**

Authors mentioned CoTr can be extended to deal with other tasks (e.g., brain structure or tumor segmentation) in the future.