Review — Recurrent U-Net for Resource-Constrained Segmentation

Recurrent U-Net (R-UNet) for Budget Concerned Application

Speed vs Accuracy. Each circle represents the performance of a model in terms frames-per-second and mIoU accuracy on Keyboard Hand Dataset using a Titan X (Pascal) GPU.
  • Very deep networks are not always easy to train without very large training datasets and tend to be relatively slow to run on standard GPUs.
  • A novel recurrent U-Net architecture is proposed that preserves the compactness of the original U-Net, while substantially increasing its performance.


  1. Recurrent U-Net (R-UNet)
  2. Results

1. Recurrent U-Net (R-UNet)

Recurrent U-Net (R-UNet).

1.1. (a) & (b) Overall Architecture

  • U-Net is used as the base architecture.
  • The first convolutional unit has 8 feature channels, and, following the original U-Net strategy, the channel number doubles after every pooling layer in the encoder.
  • The decoder relies on transposed convolutions.
  • Group normalization is used in all convolutional layers for small batch size.
  • The former can be achieved by simply concatenating, at each recurrent iteration t, the previous segmentation mask st−1 to the input image, and passing the resulting concatenated tensor through the network.
  • For the latter, a subset of the encoding and decoding layers of the U-Net is replaced with a recurrent unit. And there are two variants of its internal mechanism, as below.

1.2. (c) Dual-gated Recurrent Unit (DRU)

Dual-gated Recurrent Unit (DRU)
  • Inspired by Gated Recurrent Unit (GRU), the recurrent unit used here, which replaces multiple encoding and decoding layers of the segmentation network, is similar to GRU, in order to preserve the underlying motivation of GRUs.
  • Specifically, at iteration t, given the activations elt and the previous hidden state ht−1, the aim is to produce a candidate update ˆh for the hidden state and combine it with the previous one according to how reliable the different elements of this previous hidden state tensor are.
  • Left: To determine this reliability, an update gate defined by a tensor is used:
  • where fz(·) denotes an encoder-decoder network with the same architecture as the portion of the U-Net that is replaced with the proposed recurrent unit.
  • Bottom & Bottom Right: Similarly, the candidate update is obtained as:
  • where fh(·) is a network with the same architecture as fz(·), but a separate set of parameters, denotes the element-wise product, and r is a reset tensor allowing to mask parts of the input used to compute ˆh. It is computed as:
  • When sigmoid output is 1, there is no masking. When sigmoid output is 0, the input is close to 0 (masked).
  • Given these different tensors, the new hidden state ht is computed as:
  • Finally, the output of the recurrent unit, which corresponds to the activations of the lth decoding layer is predicted as:
  • where fs(·) is a simple convolutional block.

1.3. (d) Single-gated Recurrent Unit (SRU)

Single-gated Recurrent Unit (SRU)
  • DRU may become memory-intensive depending on the choice of l.
  • SRU has a structure similar to that of the DRU, but without the reset tensor r:
  • SRU comes at a very little loss in segmentation accuracy.

1.4. Training

Recurrent U-Net Flow (Figure from Author’s GitHub)
  • The cross-entropy loss is used.
  • Supervision at each iteration of the recurrence is used. Thus, the overall loss as:
  • where N represents the number of recursions, and is set to 3.
  • α=1, so that all iterations have equal importance, or α=0.4, seek to put more emphasis on the final prediction.

2. Results

2.1. Datasets

Hand-segmentation benchmark datasets.
  • Hand-segmentation benchmarks as above are used. However, they are relatively small, with at most 4,800 images in total.
Samples of Keyboard Hand (KBH) dataset.
Properties of new KBH dataset.
  • A larger dataset is acquired on authors’ own. Because this work was initially motivated by an augmented virtuality project whose goal is to allow someone to type on a keyboard while wearing a head-mounted display, 50 people are asked to type on 9 keyboards while wearing an HTC Vive, resulting in a total of 12,536 annotated frames, as above.
  • 20/20/60% are used to split for train/validation/test to set up a challenging scenario.
  • Retina Vessels, Roads. and Cityscapes are also used.

2.2. Hand Segmentation

Example predictions on hand segmentation datasets.
Comparing against the state of the art.
  • Ours-SRU(l) denotes different l cases, e.g.: l=3.
  • U-Net-B uses batch normalization and U-Net-G uses group normalization.
  • Rec-Last is proposed to add a recurrent unit after a convolutional segmentation network to process sequential data.
  • Rec-Middle, uses the recurrent unit to replace the bottleneck between the U-Net encoder and decoder, instead of being added at the end of the network.
  • Rec-Simple proposes a recursive refinement process, which concatenates the segmentation mask with the input image and feed it into the network.
  • U-Net-VGG16 and DRU-VGG16 are proposed to replace the U-Net and DRU encoder with a pretrained VGG-16 backbone respectively.
  • Similar for U-Net-ResNet50 and DRU-ResNet50.
  • DRU-VGG16 outperforms Ours-DRU, e.g., by 0.02 mIoU points on KBH. This, however, comes at a cost.

2.2. Retina Vessel

Retina vessel segmentation results.
  • This may be due to the availability of only limited data, which leads to overfitting for such a very deep network.
Recursive refinement. Retina, hand and road images; segmentation results after 1, 2, and 3 iterations; ground truth.

2.3. Road

Road segmentation results.
  • In particular, Ours-DRU(4) yields an mIoU 8 percentage point (pp) higher than U-Net-G, and DRU-VGG16 5pp higher than U-Net-VGG16. This verifies that the recurrent strategy helps.

2.4. Cityscapes

Cityscapes Validation Set with Resolution 1024×2048.
  • Furthermore, doubling the number of channels of the U-Net backbone increases accuracy, and so does using a pretrained VGG-16 as encoder.


1.5. Semantic Segmentation / Scene Parsing

==== My Other Previous Paper Readings ====



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store