Review: CRF-RNN — Conditional Random Fields as Recurrent Neural Networks (Semantic Segmentation)

An Approach Integrating CRF into End-to-end Deep Learning Solution

Published in

Towards Data Science

7 min readMar 3, 2019

In this story, CRF-RNN, Conditional Random Fields as Recurrent Neural Networks, by University of Oxford, Stanford University, and Baidu, is reviewed. CRF is one of the most successful graphical models in computer vision. It is found that Fully Convolutional Network (FCN) outputs a very coarse segmentation results. Thus, many approaches use CRF as post-processing steps to refine the output semantic segmentation map obtained from the the network, such as DeepLabv1 & DeepLabv2, to have a more fine-grained segmentation results. However, the parameters of CRF are not trained together with FCN. In other words, the FCN is unaware of CRF during training. This might limit the network capability.

In CRF-RNN, authors proposed to formulate CRF as RNN so that they can integrated with FCN and train the whole network in an end-to-end manner to obtain a better results. It is a 2015 ICCV paper with over 1300 citations. (Sik-Ho Tsang @ Medium)

CRF-RNN Live Demos

Authors have also created a live demo for it:

http://www.robots.ox.ac.uk/~szheng/crfasrnndemo

**We can try our own image from internet or upload our own**

Here are my trials, it is quite funny:

Marvel

Cityscape Dataset

Boats & Persons

It is quite accurate, of course I also tried some that CRF-RNN can’t work.

Outline

Conditional Random Field (CRF)
CRF as CNN for One Iteration
CRF as RNN for Multiple Iterations
Results

1. Conditional Random Field (CRF)

The purpose of CRF is to refine the coarse output based on the label at each location itself, and the neighboring positions’ labels and locations.
Fully connected pairwise CRF is considered. Fully connected means all locations are connected as shown in the middle of the figure above. Pairwise means the connections are connected in pairs.
When we are talking about CRF, we are talking about how to minimize an energy function. Here, we need to minimize the energy of a label assignment. I just treat energy as a kind of cost function. By assigning of the most probable label to each location, we can get lower energy, i.e. lower cost, and thus, higher accuracy.
The CRF is characterized by Gibbs distribution of a form:

where I is the input. Xi is the random variable at location i which represents the assigned label. I is discarded for simplicity. E(x) is the energy function and Z(I) is the partition function which is just the sum of all exp(-E(x)).
This CRF distribution P(X) is approximated by Q(X), which is a product of independent Qi(Xi):

In the paper, authors mentioned that they follows [29]. (If interested, please visit [29]. It is a 2011 NIPS paper called “Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials”.) The energy function:

1st Term, Unary Energy Ψu(xi): measures the cost if the label assignment disagrees with the initial classifier. Unary means it just takes the label of the single position into consideration at each time.
2nd Term, Pairwise Energy Ψp(xi, xj): measures the cost if two similar pixels (e.g. neighbor pixels or the pixels have similar color) take different labels:

where kG is the Gaussian kernel applied on feature vectors. The feature vector can be spatial locations and RGB values, e.g. Gaussian filter and bilateral filter.
And μ is the label compatibility function which assigns penalty when the labels are different.

CRF is a very powerful statistical modeling method applied in various pattern recognition tasks such as text sequence classification. I can only present CRF that mentioned in this paper and in a very brief way.
To be brief, the input image will go through FCN then CRF. This CRF will consider both the unary energy term and pairwise energy term, then output a more precise segmentation map.
This CRF is implemented as a stack of CNN as below.

2. CRF as CNN for One Iteration

Initialization

Ui(l) is the unary potential provided by FCN-8s which based on VGG-16.
The Qi(l) is obtained using softmax.
After initialization, there will be iterations (the while loop) for a sequence of processes.

Message Passing

M Gaussian filters are used.
Following [29], two Gaussian kernels are used, one spatial and one bilateral.

Weighting Filter Outputs

A weighted sum of the M filter outputs from the previous step for each class label l.
When each label is considered individually, it can be viewed as 1×1 convolution with M input channels and one output channel.
In contrast to [29], individual kernel weights are used for each class label.

Compatibility Transform

A penalty is assigned when different labels are assigned.
e.g.: assigning labels “person” and “bicycle” to nearby pixels should have a lesser penalty than assigning labels “sky” and “bicycle”.
Thus, μ(l, l’) is learned from the data.

Adding Unary Potentials

The output from Compatibility Transform step is subtracted element-wise from the unary inputs U.

Normalization

Another softmax operation.

**Fully connected CRFs as a CNN for one mean-field iteration**

Above is the overview of one mean-field iteration.
By repeating the above module, we can have multiple mean-field iterations.

3. CRF as RNN for Multiple Iterations

I is the image. U is the unary potentials from FCN. T is the total number of iterations.
fθ(U,H1(t),I) is the mean-field iteration as described in the previous section where θ is the CRF parameters described in the previous section, i.e. w, μ, m, l, l’.
At t = 0, the first iteration, H1(t) = softmax(U), otherwise H1(t) is the output of the previous mean-field iteration, H2(t-1).
H2(t) is the output of the mean-field iteration fθ(U,H1(t),I).
The final output, Y(t)=H2(T) when t=T, i.e. when the last iterations are finished.
Recurrent Neural Network (RNN) setting is used, i.e. the parameters here are shared among all iterations.
During training, T=5 is used to avoid vanishing/exploding gradient problem.
During testing, T=10.

4. Results

4.1. PASCAL VOC

**Mean IU Accuracy on PASCAL VOC 2012 Validation Set**

With/Without COCO: Whether the model is trained by COCO as well.
Plain FCN-8s: Lowest mean IU accuracy.
With CRF but disconnected: That means CRF is not trained with FCN in end-to-end manner, higher mean IU accuracy is obtained
End-to-end CRF-RNN: The highest mean IU accuracy is obtained which means end-to-end FCN+CRF is the best solution.

**Mean IU Accuracy on PASCAL VOC 2010, 2011, 2012 Test Set**

CRF-RNN w/o COCO: It outperforms FCN-8s and DeepLab-v1.
CRF-RNN with COCO: The results are even better.

4.2. PASCAL Context

**Mean IU Accuracy on PASCAL Context Validation Set**

CRF-RNN: Higher mean IU accuracy than FCN-8s.

4.3. Further Analyses

Additional experiments are performed on PASCAL VOC 2012 Validation Set.
Using different weights w for different classes increases 1.8% mean IU.
T=10 during both training and testing induces 0.7% drops, which argues that there is vanishing gradient effect.
Independent parameters for each iteration instead of sharing parameters, only 70.9% mean IU accuracy is obtained, which shows that recurrent structure is important.

4.4. Qualitative Results

**Some Good Results on PASCAL VOC 2012**

**Comparison with State-of-the-art Approaches**

Though CRF-RNN is published in 2015, this paper has introduced an important concept/logic to me, i.e. converting/approximating a conventional/non-deep-learning approach into deep-learning-based approach and turn it into an end-to-end solution.