Brief Review — TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

TPH-YOLOv5, Detects Small & Dense Objects in Drone Images

Sik-Ho Tsang
5 min readJun 18, 2023
3 Main Problems of Object Detection in Drone Images

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios,
TPH-YOLOv5, by Beihang University,
VisDrone 2021 ICCV Workshop, Over 400 Citations (Sik-Ho Tsang @ Medium)

Object Detection
2014 … 2020 [EfficientDet] [CSPNet] [YOLOv4] [SpineNet] [DETR] [Mish] [PP-YOLO] [Open Images] [YOLOv5] 2021 [Scaled-YOLOv4] [PVT, PVTv1] [Deformable DETR] [HRNetV2, HRNetV2p] [MDETR] [TPH-YOLOv5] 2022 [PVTv2] [Pix2Seq] [MViTv2] [SF-YOLOv5] [GLIP] 2023 [YOLOv7]
==== My Other Paper Readings Are Also Over Here ====

  • Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects.
  • Based on YOLOv5, one more prediction head is added to detect different-scale objects.
  • Then, the original prediction heads are replaced with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism.
  • Convolutional Block Attention Model (CBAM) is also integrated to find attention region on scenarios with dense objects.
  • To achieve more improvement, bags of useful strategies are provided such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier.
  • Later, there is also TPH-YOLOv5++ proposed.


  1. Brief Review of YOLOv5
  2. TPH-YOLOv5
  3. Results

1. Brief Review of YOLOv5

  • YOLOv5 has four different models including YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x.
  • Generally, YOLOv5 respectively uses the architecture of CSPDarknet53 with an SPP layer as backbone, PANet as Neck and YOLO detection head.
  • The model is trained on VisDrone2021 dataset [64] with data augmentation strategy (Mosaic from YOLOv4, and mixup), YOLOv5x is much better than YOLOv5s, YOLOv5m and YOLOv5l, and the gap of AP value is more than 1.5%.
  • In addition, according to the features of drone-captured images, the parameters of commonly used photometric distortions and geometric distortions are adjusted.

YOLOv5x is used as the basis for development of TPH-YOLOv5.


TPH-YOLOv5 Pipeline
TPH-YOLOv5 Architecture

2.1. Prediction Head for Tiny Objects

  • One more prediction head is added for tiny objects detection.
  • Combined with the other three prediction heads, the proposed four-head structure can ease the negative influence caused by violent object scale variance.

As shown in Fig. 3, the prediction head (head №1) that added is generated from low-level, high-resolution feature map, which is more sensitive to tiny objects.

2.2. Transformer Encoder

Left: Transformer Module, Right, CBAM Module
  • Some convolutional blocks and CSP bottleneck blocks in original version of YOLOv5 (Fig. 3) are replaced with Transformer encoder blocks (Fig. 4).
  • Compared to original bottleneck block in CSPDarknet53, it is believed that Transformer encoder block can capture global information and abundant contextual information.

On the VisDrone 2021 dataset, Transformer encoder blocks have better performance on occluded objects with high-density.

2.3. Convolutional Block Attention Model (CBAM)

  • As in Fig. 5, given a feature map, CBAM sequentially infers the attention map along two separate dimensions of channel and spatial, and then multiplies the attention map with the input feature map to perform adaptive feature refinement.

Using CBAM can extract the attention area to help TPH-YOLOv5 resist the confusing information and focus on useful target objects.

2.4. MS-Testing and Model Ensemble

  • 5 models are trained.
  • During inference phase, MS-Testing strategy is first performed on single model following three steps. 1) Scaling the testing image to 1.3 times. 2) Respectively reducing the image to 1 time, 0.83 times, and 0.67 times. 3) Flipping the images horizontally.

Finally, six different-scaling images are fed to TPH-YOLOv5 and NMS is used to fuse the testing predictions.

On 5 different models, the same MS-Testing operation is performed and the final five predictions are fused by WBF to get the final result.

2.5. Self-Trained Classifier

The precision of the some hard categories such as tricycle and awning-tricycle are very low.
  • The test-dev dataset is used for testing to analyze the results by visualizing the failure cases and draw a conclusion that TPH-YOLOv5 has excellent localization ability but poor classification ability.

A training set is constructed by cropping the ground-truth bounding boxes and resizing each image patches to 64×64. Then, ResNet-18 is used as classifier network for self-training.

This helps to get around 0.8%~1.0% improvement on AP value with the help of this self-trained classifier.

2.6. Pretraining & Fine-Tuning

  • In the training phase, a part of pre-trained model from YOLOv5x is used, because TPH-YOLOv5 and YOLOv5 share most part of backbone (block 0~8) and some part of head (block 10~13 and block 15~18), there are many weights can be transferred from YOLOv5x to TPH-YOLOv5.

By sharing these weights, a lot of training time can be saved.

  • Then, 65 epochs are used for training.

3. Results

3.1. VisDrone 2021 Dataset

When the input image size is set to 1536, there are 622 of 342391 labels are less than 3 pixels in size. As shown in Fig. 7, these small objects are hard to recognize.

  • When gray squares are used to cover these small objects and train our model on the processed dataset, the mAP improves by 0.2, better than not. This shows the difficulty of the dataset.

3.2. SOTA Comparisons

SOTA Comparisons
  • Due to the limited number of submissions in the VisDrone2021 competition server, only the results of 4 models on testset-challenge and the final results of the ensemble of 5 models are obtained.

A good score of 39.18 is obtained on testset-challenge, which is much higher than VisDrone2020’s best score of 37.37.

Ranked fifth in the VisDrone 2021 leaderboard, the score is 0.25 lower than the 39.43 of the first place.

3.3. Ablation Studies

Ablation Studies

Each component contributes for the gain in mAP.

3.4. Visualizations

Visualizations of TPH-YOLOv5 on testset-challenge
  • The above figure shows the result of large objects, tiny objects, dense objects and the image covering a large area.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.