Brief Review — CornerNet-Lite: Efficient Keypoint-Based Object Detection

Designed CornerNet-Saccade & CornerNet-Squeeze, Outperforms YOLOv3 & CornerNet

Sik-Ho Tsang
4 min readJul 5, 2023
CornerNet-Saccade & CornerNet-Squeeze, Two Efficient Object Detector Based on CornerNet

CornerNet-Lite: Efficient Keypoint-Based Object Detection
CornerNet-Lite, by Princeton University
2020 BMVC, Over 200 Citations (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2022 [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] [TPH-YOLOv5++] 2023 [YOLOv7]
==== My Other Paper Readings Are Also Over Here ====

  • CornerNet-Lite is a combination of two efficient variants of CornerNet:
  • CornerNet-Saccade, which uses an attention mechanism to eliminate the need for exhaustively processing all pixels of the image, and CornerNet-Squeeze, which introduces a new compact backbone architecture.

Outline

  1. CornerNet-Saccade
  2. CornerNet-Squeeze
  3. Results

1. CornerNet-Saccade

CornerNet-Saccade

1.1. Estimating Object Locations

The first step in CornerNet-Saccade is to obtain possible object locations in an image. Downsized full images are used to predict attention maps, which indicate both the locations and the coarse scales of the objects at the locations.

  • For a downsized image, CornerNet-Saccade predicts 3 attention maps, one for small objects, one for medium objects and one for large objects.
  • The feature maps are obtained from the backbone network in CornerNet-Saccade, which is an Hourglass Network. The feature maps from the upsampling layers in the Hourglass are used to predict the attention maps.
  • The attention maps are predicted by applying a 3×3 Conv-ReLU module followed by a 1×1 Conv-Sigmoid module to each feature map.
  • During testing, only process locations where scores are above a threshold t=0.3. During training, set the center of each bounding box on the corresponding attention map to be positive and the rest to negatives. Focal loss (RetinaNet) with α = 2 is used.

1.2. Detecting Objects

  • Based on the locations obtained from the attention maps, we can determine different zoom-in scales for different object sizes.
  • At each possible location (x, y), the downsized image is enlarged by si, where i ∈ {s, m, l} depending on the coarse object scale.
  • Bounding boxes are also output at the end of Hourglass Network.
  • Soft-NMS [2] is applied to remove redundant boxes.

Left: After cropping, the bounding boxes which touch the crop boundary are removed.

Right: For highly overlapping boxes, strategy similar to NMS is applied.

  • The object locations are then ranked, prioritizing locations from bounding boxes over locations from the attention maps. The best object location is kept and the locations that are close to the best location are removed.

During training, the same training losses in CornerNet are applied to train the network to predict corner heatmaps, embeddings and offsets.

1.3. Backbone

A new Hourglass backbone network, Hourglass-54, is proposed that works better in CornerNet-Saccade. The new Hourglass network consists of 3 Hourglass modules and has a depth of 54 layers, while Hourglass-104 in CornerNet consists of 2 Hourglass modules and has a depth of 104.

  • Each Hourglass module in Hourglass-54 has fewer parameters and is shallower than the one in Hourglass-104.
  • Few further small modifications are also used in the network. Please feel free to read the paper directly.

2. CornerNet-Squeeze

In contrast to CornerNet-Saccade, CornerNet-Squeeze focuses on reducing the amount of processing per pixel.

2.1. Ideas From SqueezeNet and MobileNet

The modified fire module, originated in SqueezeNet, is used in CornerNet-Squeezeinstead of the residual block as above. Furthermore, inspired by the success of MobileNets, replace the 3×3 standard convolution in the second layer with a 3×3 depth-wise separable convolution, which further improves inference time.

  • Again, few further small modifications are also used in the network. Please feel free to read the paper directly.

3. Results

3.1. Training Efficiency

Training Efficiency
  • CornerNet-Saccade is trained on only four 1080Ti GPUs with a total of 44GB GPU memory while CornerNet requires ten Titan X (PASCAL) GPUs with a total of 120GB GPU memory.

The memory usage is reduced by more than 60%.

3.2. Performance Analysis of Hourglass-54 in CornerNet-Saccade.

Performance Analysis of Hourglass-54 in CornerNet-Saccade.

CornerNet-Saccade with Hourglass-54 (42.6% AP) is more accurate than with Hourglass-104 (41.4%).

  • Hourglass-54 produces better bounding boxes (38.2% AP) when combined with saccade, than the one combining with CornerNet (37.2% AP).

3.3. Ablation Study on Corner-Squeeze

Ablation Study on Corner-Squeeze

The proposed squeezed Hourglass network achieves better performance and efficiency than existing networks.

3.4. Comparison With YOLOv3 & CornerNet

CornerNet-Squeeze is faster and more accurate than YOLOv3.

CornerNet-Saccade is more accurate than CornerNet at multi-scales and 6 times faster.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.