# Review — Recurrent U-Net for Resource-Constrained Segmentation

**Recurrent U-Net (R-UNet) for Budget Concerned Application**

Recurrent U-Net for Resource-Constrained Segmentation,Recurrent U-Net (R-UNet), by CVLab, EPFL,2019 ICCV, Over 80 Citations(Sik-Ho Tsang @ Medium)

Semantic Segmentation, U-Net

# Outline

**Recurrent****U-Net****(R-UNet)****Results**

**1. Recurrent **U-Net** (R-UNet)**

The

goalis to operate inresource-constrained environmentsandkeep the model relatively simple.

## 1.1. (a) & (b) Overall Architecture

**U-Net**is used as the**base architecture.**- The first convolutional unit has 8 feature channels, and, following the original U-Net strategy, the channel number doubles after every pooling layer in the encoder.
- The decoder relies on transposed convolutions.
**Group normalization**is used in all convolutional layers for small batch size.

Recursionsare integrated on1) the predicted segmentation maskands2) multiple internal states of the network.

- The
**former**can be achieved by**simply concatenating**, at each recurrent iteration*t*, the previous segmentation mask*st*−1 to the input image, and passing the resulting concatenated tensor through the network. - For the
**latter**, a subset of the encoding and decoding layers of the**a recurrent unit**. And there are**two variants**of its internal mechanism, as below.

## 1.2. (c) Dual-gated Recurrent Unit (DRU)

- Inspired by Gated Recurrent Unit (GRU),
**the recurrent unit**used here, which replaces multiple encoding and decoding layers of the segmentation network, is**similar to****GRU**, in order to**preserve the underlying motivation of****GRU**s. - Specifically, at
**iteration**, given the*t***activations**and the*elt***previous hidden state**, the aim is to*ht*−1**produce a candidate update ˆ**for the hidden state and combine it with the previous one according to how reliable the different elements of this previous hidden state tensor are.*h* **Left**: To determine this reliability, an**update gate**defined by a tensor is used:

- where
denotes an*fz*(·)**encoder-decoder network**with the same architecture as the portion of the**U-Net**that is replaced with the proposed recurrent unit. **Bottom & Bottom Right**: Similarly, the**candidate update**is obtained as:

- where
is a network with the same architecture as*fh*(·), but a separate set of parameters,*fz*(·)**⊙**denotes the**element-wise product**, andis*r***a reset tensor**allowing to**mask parts of the input used to compute ˆ**. It is computed as:*h*

- When sigmoid output is 1, there is no masking. When sigmoid output is 0, the input is close to 0 (masked).
- Given these different tensors, the
**new hidden state**is computed as:*ht*

- Finally,
**the output of the recurrent unit**, which corresponds to the activations of theis predicted as:*l*th decoding layer

- where
is a*fs*(·)**simple convolutional block**.

Since it

relies on two gates,, it is calledrandzDual-gated Recurrent Unit (DRU).

## 1.3. (d) Single-gated Recurrent Unit (SRU)

**DRU may become memory-intensive**depending on the choice of*l*.**SRU**has a structure similar to that of the DRU, but**without the reset tensor**:*r*

- SRU comes at a very
**little loss in segmentation accuracy**.

## 1.4. Training

- The
**cross-entropy loss**is used. **Supervision at each iteration**of the recurrence is used. Thus, the**overall loss**as:

- where
represents the*N***number of recursions**, and is**set to 3**. , so that all iterations have*α*=1**equal****importance**, or, seek to*α*=0.4**put more emphasis on the final prediction.**

# 2. Results

## 2.1. Datasets

**Hand-segmentation benchmarks**as above are used. However, they are relatively small, with at most 4,800 images in total.

**A larger dataset**is acquired on authors’ own. Because this work was initially motivated by an**augmented virtuality project**whose goal is to allow someone to type on a keyboard while wearing a head-mounted display,**50 people are asked to type on 9 keyboards**while wearing an HTC Vive, resulting in a total of**12,536 annotated frames**, as above.**20/20/60%**are used to split for**train/validation/test**to set up a**challenging scenario**.**Retina Vessels**,**Roads**. and**Cityscapes**

## 2.2. **Hand Segmentation**

**Ours-SRU(**denotes different*l*)*l*cases, e.g.:*l*=3.**U-Net****-B**uses batch normalization and**U-Net****-G**uses group normalization.**Rec-Last**is proposed to add a recurrent unit after a convolutional segmentation network to process sequential data.**Rec-Middle**, uses the recurrent unit to**replace the bottleneck between the****U-Net****encoder and decoder**, instead of being added at the end of the network.**Rec-Simple**proposes a recursive refinement process, which**concatenates the segmentation mask with the input image**and feed it into the network.**U-Net****-****VGG****16**and**DRU-****VGG****16**are proposed to replace the U-Net and DRU encoder with a**pretrained****VGG****-16 backbone**respectively.- Similar for
**U-Net****-****ResNet****50**and**DRU-****ResNet****50**.

Overall, among the light models,

the recurrent methods usually outperform the one-shot ones.Besides, among the recurrent ones,Ours-DRU(4) and Ours-SRU(0) clearly dominate, withOurs-DRU(4) usually outperforming Ours-SRU(0) by a small margin.

Ours-DRU(4) is better than the heavy RefineNet model on 4 out of the 5 datasets, despite RefineNet representing the current state of the art.

DRU-VGG16 model, which, by using apretraineddeep backbone, yields theoverall best performance.

- DRU-VGG16 outperforms Ours-DRU, e.g., by 0.02 mIoU points on KBH. This, however, comes at a cost.

To be precise,

DRU-VGG16has41.38M parameters. This is100 times larger than Ours-DRU(4), which hasonly 0.36M parameters.Moreover,

DRU-VGG16runs only at18 fps, whileOurs-DRU(4)reaches61 fps.

## 2.2. Retina Vessel

DRU yields the best mIOU, mPrec and mF1 scores. Interestingly, on this dataset, iteven outperforms the larger DRU-VGG16 andDeepLabv3+, which performs comparatively poorly on this task.

- This may be due to the availability of
**only limited data**, which leads to**overfitting**for such a very deep network.

Even the

tiny vessel branchesin the retina which areignored by the human annotatorscould becorrectly segmented by the proposed algorithm. Better viewed in color and zoom in.

## 2.3. Road

The proposed methods also

outperform all the baselines by a clear margin on this task, with or without ImageNet pretraining.

- In particular, Ours-DRU(4) yields an mIoU 8 percentage point (pp) higher than U-Net-G, and DRU-VGG16 5pp higher than U-Net-VGG16. This verifies that
**the recurrent strategy helps.**

## 2.4. Cityscapes

Ours-DRU is

consistently better thanU-Net-G and than the best recurrent baseline, i.e., Rec-Last.

- Furthermore,
**doubling the number of channels**of the U-Net backbone**increases accuracy**, and so does using a pretrained VGG-16 as encoder.

Ultimately,

DRU-VGG16 model yields comparable accuracy with the state-of-the-artDeepLabv3one, despite its use of a ResNet101 backbone.

It is **practical **for **real-time application**, reaching 55 frames-per-second (fps) to segment 230×306 images on an NVIDIA TITAN X with** only 12G memory**.

## Reference

[2019 ICCV] [Recurrent U-Net (R-UNet)]

Recurrent U-Net for Resource-Constrained Segmentation

## 1.5. Semantic Segmentation / Scene Parsing

**2015** … **2019** … [Deep Recurrent U-Net (DRU)] **2021** [PVT, PVTv1] [SETR] [Trans10K-v2, Trans2Seg] **2022 **[PVTv2]