Review — Mantini’s VISAPP’19: Generative Reference Model and Deep Learned Features (Camera Tampering Detection)

Using Generative Adversarial Network (GAN), and Siamese Network for Camera Tampering Detection

6 min readFeb 27, 2021

In this story, Camera Tampering Detection using Generative Reference Model and Deep Learned Features, by University of Houston, is reviewed. In this paper:

By using GAN, a generative model is employed to learn the distribution of images under normal operating conditions from the surveillance camera.
Siamese network is trained to transform the images into a feature space, so as to maximize the distance between the generated images and tampered images. Based on the distance, the surveillance camera image is classified as either normal or tampered.

This is a paper in 2019 VISAPP. (Sik-Ho Tsang @ Medium)

Outline

Proposed Framework
Generator as a Reference Model
Siamese Network as a Feature Extractor
Training Strategy
Experimental Results

1. Proposed Framework

**Proposed framework for camera tampering detection**

The proposed framework consists of

A deconvolutional neural network (generator).
A pair of convolutional neural network (CNN) with shared weights (Siamese network).
A fully connected neural network.

The generator takes as input a vector of random numbers. It generates an image that represents surveillance camera under normal operating condition.
The image from camera at time t along with the generated image, are used as input to the pair of CNNs that share weights. This stage acts as a feature extractor for the generated and test images.
The distance between transformed features are input to a fully connected neural network. The output is a posterior value estimating the probability of choosing class C given the distance between two inputs.

2. Generator as a Reference Model

The generator G aims at creating images x* that are visually similar to training example.
The discriminator D aims at distinguishing the generated image x* from the original training image x.
Assume that D assigns a high score to an image, if it is the original image, and a low score if it is generated.

2.1. The Generator

The input is a matrix of random numbers with size 16×16×256, the numbers are passed through three 2D upsampling layers.
Each upsampling layer is followed by a 2D convolution layer.
Batch Norm and ReLU are used.
The output is a matrix of size 127×127×3.
The goal of generator is to maximize D(G(y)) (or minimize 1-D(G(y))). The generator is trained to optimize the following function:

where V is the loss function.

2.2. The Discriminator

**The Discriminator: Network Architecture**

The discriminator consists of four 2D convolution layers.
Batch Norm and Leaky ReLU are used.
25% Dropout is used for each activation.
The discriminator optimizes the following function:

The combined loss function for the generator and the discriminator is:

**a) generated image (daytime), b) original image (daytime), c) generated image (nighttime), and d) original image (nighttime).**

The above figure shows images generated by GAN ((a), (c)) and compares against the original images ((b), (d)).
The two sets of images are representative of day and night.
Log scaling is applied on the night image.

3. Siamese Network as a Feature Extractor

**Siamese Network: Network Architecture**

The generator synthesizes reference images (x*).
The images from the camera (x) are compared against the synthesized images (x*) using a distance measure that is used to detect a tamper.

In Siamese network, x and x* are transformed into another feature space, such that the distance between the transformed features of x and x* is maximum, if x is tampered; and minimum, if x is normal.

The base CNN consists of two convolution layers, each followed by ReLU activation and 2D max pooling layers.
The network consists of two parallel convolutional network sharing the same set of weights.
The distance vector is given as input to a fully connected layer followed by a Dropout and ReLU activation layer.
Finally passed through another fully connected layer, the output of which, is mapped to posterior values of the four classes using softmax activation.
The four classes represent normal, covered, defocused, and moved status of the camera.

4. Training Strategy

The generative adversarial network and the Siamese network are trained separately.
The data is split into two cluster for day and night to train individual GANs and the Siamese network.
The GANs and Siamese network are trained over 5 and 10 epochs, respectively.

4.1. Training the GAN

**(a) Day (b) Light (c,e) K=0, (d,f) K=1**

The training data is segmented into multiple clusters based on their color features using K-means. (K=2, maybe day and light images)
While testing, the suitable GAN is selected based on the image’s distance to the cluster.
Normal images captured are used to train the GAN.

4.2. Training the Siamese Network

**a) Original, b) Covered, c) Defocused, and d) Moved images.**

Four classes of data are required for training the Siamese network.
Spatial translation, spatial smoothing, and pixel copy operations are applied to synthesize moved, defocused, and covered tampers, respectively.
Synthetic data containing a uniform distribution of four classes are used to train the Siamese network.
(The synthesis procedures should be the same or similar to UHCTD. Please feel free to visit UHCTD.)

5. Experimental Results

5.1. Performance Evaluation

**Performance comparison, TP — true positives, FP — false positives, TN — true negatives, FN — false negatives, TPR — true positive rate, FPR, false positive rate, and Acc — accuracy**

Proposed2: A simple temporal analysis mechanism to suppress spurious false alarms. It detects a tamper at time t, by taking the mode of the class predictions from the previous t-n instances.

where n=3.
The performance of detecting the three classes of tampering is quantified by dividing the dataset into three sets, each containing of normal images and only one variety of tamper (covered, defocused, and moved).
Results show Proposed2 to perform best with an overall accuracy of 95%, followed by the proposed method with 91%, Mantini and Shah’s (2017) with 85%, and Lee’s (2014) with 25%.
Lee’s (2014) cannot cope with the complexity of the scene with 97% false positive rate.
Mantini and Shah’s (2017) method is capable of detecting defocused images better than other tampers.
Mantini and Shah’s (2017) shows a higher accuracy for covered and defocused images, and produced less false positives as well, compared to the proposed system.
Proposed2 method outperforms Mantini and Shah’s (2017) with respect to accuracy and false positive rate.

The proposed method is highly capable of detecting tampering, it detected 99%, 98% and 99% of covered, defocused, and moved tampers respectively, while Proposed2 detected 93%, 91% and 91%, and Mantini and Shah’s (2017) detected 67%, 86% and 13%.
Temporal analysis lowers the false positives, but effects the systems capability to detect tampering.

5.2. Confusion Matrix

There is a noticeable confusion amongst the three tampering classes and normal images. These correspond to the false alarms.
The false negatives are minimal. Five percent of moved images have been classified as covered, and two percent of defocused images are classified as moved.

5.3. Disadvantages

The proposed system requires a large dataset for training.
The proposed method does not formally introduce an online mechanism to update the trained model. So, the performance of the system under extreme weather conditions is unpredictable.

In the future, authors wish to explore the performance of the system using scene independent features, and validate if the learning can transfer to various scenes.

Reference

[2019 VISAPP] [Mantini’s VISAPP’19]
Camera Tampering Detection using Generative Reference Model and Deep Learned Features

Camera Tampering Detection

2016 [Dong’s ICDSP’16] 2019 [VFI-ConvLSTM] [UHCTD] [Mantini’s VISAPP’19]