# Review — FCOS: Fully Convolutional One-Stage Object Detection

## FCOS: Training Without the Use of Anchor Boxes

--

In this paper, **FCOS: Fully Convolutional One-Stage Object Detection**, by The University of Adelaide, is reviewed. In this paper:

FCOS is designed in which it is

anchor box free, as well asproposal free.

- FCOS completely
**avoids the complicated computation related to anchor boxes**such as calculating overlapping during training. - It also
**avoids all hyper-parameters related to anchor boxes**. **FCOS encourages to rethink the need of anchor boxes.**

This is a paper in **2019 ICCV **with over **1000 citations**. (Sik-Ho Tsang @ Medium)

# Outline

**FCOS: Network Architecture****Multi-level Prediction with****FPN****for FCOS****Ablation Study****SOTA Comparison**

**1. FCOS: Network Architecture**

## 1.1. Notations, Inputs, Outputs

- Let
(Size of*Fi**H*×*W*×*C*) be the**feature maps at layer**and*i*of a backbone CNN*s*be the total stride until the layer. **The ground-truth bounding boxes**for an input image are defined as {*Bi*}, wherewhere {*Bi*=(*x*(i)0,*y*(*i*)0,*x*(*i*)1,*y*(*i*)1,*c*(*i*))*x*(i)0,*y*(*i*)0}, {*x*(*i*)1,*y*(*i*)1} are the the coordinates of the**left-top and right-bottom corners**of the bounding box respectively.*c*(*i*) is the**object class**.*C*is the total number of classes. e.g.:*C*=80 in MS COCO.- Specifically, location
**(**is considered as a*x*,*y*)**positive**sample**if it falls into any ground-truth box**and the**class label**of the location is the class label of the ground-truth box.*c****Otherwise**it is a**negative sample**and*c**=0 (background class). **A 4D real vector**, as shown in the first figure at the top of the story. Here*t**=(*l**,*t**,*r**,*b**) being the regression targets for the location*l**,*t**,*r**,*b**are the distances from the location to the four sides of the bounding box.- Formally,
**if location (**can be formulated as (Eq. (1)):*x*,*y*) is associated to a bounding box*Bi*, the training regression targets for the location

- The final layer of our networks
**predicts an 80D vector**and*p*of classification labels**a 4D vector**.*t*= (*l*,*t*,*r*,*b*) bounding box coordinates are trained.*C*binary classifiers**4 convolutional layers are added after the feature maps of the backbone networks respectively for classification and regression branches.**- Since the
**regression targets**are always positive,**exp(**to map any real number to (0,∞) on the top of the regression branch.*x*) is employed

It is worth noting that

FCOS has 9× fewer network output variables than the popular anchor-based detectors [15, 24] (RetinaNet[15])with 9 anchor boxes per location.

## 1.2. Loss Function

- The loss function is (Eq. (2)):

- where
*Lcls*is focal loss in RetinaNet.*Lreg*is the IoU loss.*Npos*denotes the number of positive samples and*λ*being 1 in this paper is the balance weight for*Lreg*. **The summation is calculated over all locations on the feature maps**1{*Fi*.*c***i*>0} is the indicator function, being 1 if*c***i*> 0 and 0 otherwise.

## 1.3. Inference

- Given an input images,
**the image goes through the network**and**obtain the classification scores**on the feature maps*px*,*y*and the regression prediction*tx*,*y*for each location*Fi*. - Following RetinaNet, we choose the
**location with**as*px*,*y*> 0.05**positive samples**and**invert Eq. (1) to obtain the predicted bounding boxes**.

# 2. Multi-level Prediction with FPN for FCOS

## 2.1. Multi-level Prediction with FPN

**If a location falls into multiple bounding boxes**, it is considered as an**ambiguous sample**. We simply choose the bounding box with minimal area as its regression target.

Multi-level Prediction with FPN can reduce the number of ambiguous samples significantly.

- Unlike anchor-based detectors, which assign anchor boxes with different sizes to different feature levels, the range of bounding box regression is directly limited for each level.
- The range
*mi*is the maximum distance that feature level*i*needs to regress. In this work,*m*2,*m*3,*m*4,*m*5,*m*6 and*m*7 are set as 0, 64, 128, 256, 512 and ∞, respectively. - More specifically, we
**firstly compute the regression targets**Next, if a location*l**,*t**,*r**,*b**for each location on all feature levels.**satisfies max(**, it is set as a*l**,*t**,*r**,*b**) >*mi*or max(*l**,*t**,*r**,*b**) <*mi*−1**negative sample**and is thus**not required to regress a bounding box**anymore. - If a location, even with multi-level prediction used, is still assigned to more than one ground-truth boxes,
**the ground-truth box with minimal area is simply chosen as its target.** **The heads are shared between different feature levels**, not only making the detector parameter-efficient but also improving the detection performance.- As different feature maps target for different object sizes, exp(
*si*×*x*) with a trainable scalar*si*to automatically adjust the base of the exponential function for feature level Pi, instead of exp(*x*), improve a bit of accuracy performance.

## 2.2. Center-ness for FCOS

- After using multi-level prediction in FCOS, there is still a performance gap between FCOS and anchor-based detectors. It is due to
**a lot of low-quality predicted bounding boxes produced by locations far away from the center of an object.** **Center-ness**is used to solve this problem (Eq. (3)):

- The
**center-ness**depicts**the normalized distance from the location to the center of the object that the location is responsible for**. **The center-ness ranges from 0 to 1 and is thus trained with binary cross entropy (BCE) loss.**The loss is**added to the loss function Eq. (2).**- When testing,
**the final score**(used for ranking the detected bounding boxes)**is computed by multiplying the predicted center-ness with the corresponding classification score.**Thus the center-ness can down-weight the scores of bounding boxes far from the center of an object.

**3. Ablation Study**

## 3.1. **Multi-level Prediction with ****FPN**

- COCO trainval35k split (115K images) is used for training and minival split (5K images) is used for validation.

With Multi-level Prediction withFPN, the number of ambiguous sample is largely reduced.

## 3.2. Center-ness

- The center-ness+ can also be computed with the predicted regression vector without introducing the extra center-ness branch. But it cannot improve AP.

The center-ness branch can boost AP from 33.5 to 37.1%.

## 3.3. Other Improvements

**“ctr. on reg.”**: moving the center-ness branch to the regression branch.**“ctr. sampling”**: only sampling the central portion of ground-truth boxes as positive samples.**“GIoU”**: penalizing the union area over the circumscribed rectangle’s area in IoU Loss.**“Normalization”**: normalizing the regression targets in Eq. (1) with the strides of FPN levels

The performance of our anchor-free detector can be improved by a large margin by adding other improvements.Based on this, authors encourage the community to

rethink the necessity of anchor boxes in object detection.

- These improvements are not used in SOTA comparison below. It is found out after the initial submission of the paper.

# 4. SOTA Comparison

## 4.1. MS COCO Test-Dev Split

- FCOS with other state-of-the-art object detectors on test-dev split of MS-COCO benchmark.
- With ResNet-101-FPN,
**FCOS outperforms the****RetinaNet****by 2.4% AP.**

This is

the first timethat ananchor-free detector, without any bells and whistlesoutperforms anchor-based detectors by a large margin.

- FCOS also
**outperforms**other classical two-stage anchor-based detectors such as**Faster R-CNN****by a large margin.** - With ResNeXt-64x4d-101-FPN as the backbone, FCOS achieves 43.2% in AP. It
**outperforms**the recent state-of-the-art**anchor-free detector****CornerNet****by a large margin**while being much simpler.

FCOS is more likely to serve as a

strong and simple alternative to current mainstream anchor-based detectors.

- Moreover,
**FCOS with the improvements in Section 3****achieves 44.7%****AP**with single-model and single scale testing, which surpasses previous detectors by a large margin.

## 4.2. Extensions on Region Proposal Networks

FCOS can be also able to

replace the anchor boxes in Region Proposal Networks (RPNs)with FPN.

- Even without the proposed center-ness branch, FCOS already improves both AR100 and AR1k significantly.
- With the proposed center-ness branch, FCOS further boosts AR100 and AR1k respectively to 52.8% and 60.3%, which are 18% relative improvement for AR100 and 3.4% absolute improvement for AR1k over the RPNs with FPN.

## Reference

[2019 ICCV] [FCOS]

FCOS: Fully Convolutional One-Stage Object Detection

## Object Detection

**2014–2017**: …**2018**: [YOLOv3] [Cascade R-CNN] [MegDet] [StairNet] [RefineDet] [CornerNet] [Pelee & PeleeNet]**2019**: [DCNv2] [Rethinking ImageNet Pre-training] [GRF-DSOD & GRF-SSD] [CenterNet] [Grid R-CNN] [NAS-FPN] [ASFF] [Bag of Freebies] [VoVNet/OSANet] [FCOS]**2020**: [EfficientDet] [CSPNet] [YOLOv4] [SpineNet]**2021**: [Scaled-YOLOv4]