Brief Review — Real-time Multi-Class Helmet Violation Detection Using Few-Shot Data Sampling Technique and YOLOv8

YOLOv8 for Helmet Violation Detection

5 min readAug 12, 2024

Real-time Multi-Class Helmet Violation Detection Using Few-Shot Data Sampling Technique and YOLOv8
YOLOv8 for Helmet Detection, by Northwestern University, and University of Missouri-Columbia
2023 CVPR Workshop, Over 120 Citations (Sik-Ho Tsang @ Medium)
Object Detection
2014 … 2023 [YOLOv7] [YOLOv8] [Lite DETR] 2024 [YOLOv9] [YOLOv10] [RT-DETR]
==== My Other Paper Readings Are Also Over Here ====

YOLOv8 is used for detecting helmet violations in real-time from video frames. A few-shot data sampling is developed for a model that has fewer annotations.
(While this paper does not modify YOLOv8 for Helmet Detection, it is important to learn from the paper that how it samples the video frames for training.)

Outline

Proposed Few-Shot Data Sampling
Proposed Data Augmentation
Results

1. Proposed Few-Shot Data Sampling

**Proposed Few-Shot Data Sampling & Data Augmentation**

1.1. Few-Shot Data Sampling Framework

The initial ground truth annotations were provided as part of the challenge, however, there were some missing annotations, which have a significant impact when training a model.
To address this without manually reviewing all 20,000 frames and correcting annotations, a few-shot data sampling framework was developed.
This framework was designed to help select the most representative frames of the entire dataset and minimize the need for re-annotation of all 20,000 frames.

1.2. Background Determination

First, the background in each video is determined: To estimate the background of a video, authors first randomly select frames within a 10-second period. Next, we compute the median of 60 percent of all frames in the sample. By using random sampling and determining the median of a subset of images, we are able to negate the impact of short-term video resolution changes such as zooms, and pixelation.
Second, the proposed algorothm is used to categorize the videos according to the time of day and weather conditions such as day, night, and fog. It is important to ensure a balanced representation of all video types.

1.3. Daynight & Weather Categorization

The proposed algorithm takes the estimated video background and calculates the frequency of each pixel.
If the maximum frequency corresponds to a pixel value less than 150, the algorithm classifies the image as night. Otherwise, the algorithm classifies the image as day or foggy.
To distinguish between daytime and foggy videos, the skewness of the image frequencies is computed. The algorithm classifies the video as foggy if the absolute skewness is close to zero. The frequency distribution of the day, night, and fog images are shown above.

1.4. Frame Sampling

Lastly, a frame sampling algorithm is developed that aims to select more frames from video types that were underrepresented, as identified.
With the information regarding the total number of videos in each category and the fps of each, a sample rate is calculated for each video category.

2. Data Augmentation

2.1. Data Augmentation for Training

Image flipping: Flipping the image horizontally, learn to detect helmets from both sides of the motorcycle.
Rotation: changing the viewpoint angle of the helmet.
Scaling: detect helmets of different sizes.
Cropping: detect helmets even when they are partially obscured.
Blurring: detect helmets under poor lighting conditions.
Color manipulation: detect helmets in different lighting conditions.

2.2. Test Time Augmentation (TTA)

TTA involves applying data augmentation techniques, such as rotation, flipping, or cropping, to the test data and then making predictions on each augmented version of the test data.
The final prediction is then made by averaging the predictions made on the augmented versions of the test data.

3. Results

3.1. Train, Val, Test Sets

NVIDIA GeForce RTX 3090 GPU is used for 4,500 training examples. The dataset was divided in a ratio of 0.7:0.3 for training and validation respectively.
The Semantic Clustering by Adopting Nearest Neighbors (SCAN) algorithm [10] is employed to eliminate frames with high similarity.
Image sequences did not appear in both the training and validation datasets.
Additionally, the estimated backgrounds of each video, along with their augmentations, are added into the training data as negative samples.
The 2023 NVIDIA AI City Challenge Task 5 includes 100 unannotated videos for testing. Each testing video has a 20-second duration and a resolution 1920 × 1080 pixel.

3.2. Performance

Test time augmentation (TTA) further enhanced the performance of the models. Notably, YOLOv8+TTA has demonstrated the highest mAP.05-.95 score of 0.647.