Brief Review — Detecting Heads using Feature Refine Net and Cascaded Multi-scale Architecture

SCUT-HEAD Dataset is Proposed

Sik-Ho Tsang
5 min readOct 31, 2024

Detecting Heads using Feature Refine Net and Cascaded Multi-scale Architecture
FRN, SCUT-HEAD
, by South China University of Technology
2018 ICPR, Over 80 Citations (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2024
[YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====

  • A Feature Refine Net (FRN), and a cascaded multi-scale architecture is proposed for head detection.
  • A new large dataset named SCUT-HEAD, is collected and labeled, which includes 4405 images with 111251 heads annotated.

Outline

  1. Feature Refine Net (FRN)
  2. SCUT-HEAD Dataset
  3. Results

1. Feature Refine Net (FRN)

Feature Refine Net (FRN)
  • FRN is based on R-FCN with ResNet-50 as backbone.
  • Two modified R-FCNs named local detector and global detector are trained for the cascaded multi-scale architecture

The cascaded multi-scale architecture consists of four stages of (1) a global detector that works on the entire image to detect large heads and obtains the rough location of small heads; (2) multiple clips which have high probability of containing small heads; (3) a local detector that works on the clips and results in more accurate head detection.

1.1. FRN

  • Feature Refine Net (FRN) refines the multiple feature maps res3, res4 and res5 from ResNet.
  • Firstly, through channel weighting, each channel of feature maps is multiplied by the corresponding learnable weight.
  • Then, feature decomposition upsampling is used to increase the resolution of res4 and res5 twofold.

Feature maps are concatenated and undergo Inception-style synthesis yielding refined features.

1.2. Channel Weighting

  • Direct usage of features may not be the best choice.
  • Thus, channel weighting is used to select and take advantage of the most useful features.
  • where wi is the weight parameter, and are optimized by backpropagating.
  • Channel weighting is applied to res3, res4 and res5.

1.3. Feature Decomposition Upsampling

  • Every pixel in a feature map is related to a local region of feature maps at low level.
  • Therefore, each pixel is decomposed to a N×N region to upsample a feature map.
  • A mapping matrix M_N×N is used to represent the relationship between the input pixel p and the decomposed N×N region P_N×N.

Because each channel of feature maps represent a specific feature of an object, different mapping matrices are used for different channels. The relationship between the upsampled feature maps F_C×WN×HN and the input feature maps f_C×W×H can be expressed as follows:

Specifically, the weighted feature maps from res4, res5 are upsampled to match the scale of res3. Every pixel is decomposed to a 2×2 region.

1.4. Inception-Style Synthesis

Inception-Style Synthesis
  • After upsampling, multiple feature maps are concatenated together. Yet, concatenated feature maps have too many channels and their scale is too large.

Inception-Style Synthesis is used, which borrows the concept of Inception module. After Inception-style synthesis, the number of channels decreases from 3584 to 1024 and the scale of feature maps is reduced by 50%.

1.5. Cascaded Multi-Scale Architecture

  • At training stage, global detector and local detector are trained separately.
  • The main difference in training strategies is the dataset.

The global detector is trained on the original dataset while the local detector is trained on the dataset generated from the original dataset, aiming to small head detection.

For each image in the original dataset, a w×w clip is cropped which is centered at each small head annotation and reserve the small head annotations. All the clips are resized to f times larger, yielding the new dataset for the local detector.

  • At testing stage, the global detector is applied on the original image and produces the coordinates of big heads and the rough location of small heads. Then, multiple w×w clips are cropped from the input image. The clips are resized to f times larger and used as the input of local detector. Local detector produces better detection of small heads. Finally, the outputs of both detectors are merged and non maximum suppression (NMS) is applied on the merged results.
  • Heads with average scale less than 20px are treated as small heads.
  • Considering the computational complexity, f is set to 3.

2. SCUT-HEAD Dataset

SCUT-HEAD Dataset
  • The proposed dataset consists of two parts.

PartA includes 2000 images sampled from monitor videos of classrooms in a university with 67321 heads annotated.

PartB includes 2405 images crawled from the Internet with 43930 heads annotated.

  • Both PartA and PartB are divided into training and testing parts. 1500 images of PartA are for training and 500 for testing. 1905 images of PartB are for training and 500 for testing.

3. Results

SOTA Comparisons

The proposed method has a great improvement compared to other methods especially after applying the cascaded multi-scale architecture.

Ablation Studies

The final design of FRN reaches the best result.

Small Head Detection
  • Small head detection is improved through two ways.

The first is FRN which combines multiple features at different levels together. The second is the cascaded multi-scale architecture.

Brainwash dataset
  • Brainwash dataset contains 91146 heads annotated in 11917 images.

The proposed method also achieves state-of-the-art performance on this dataset compared with several baselines.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet