Reading: MS-ROI — Multi-Structure Region Of Interest (JPEG Image Compression)

Semantic Perceptual Image Compression Using Deep Convolution Networks

4 min readJun 8, 2020

In this story, Multi-Structure Region Of Interest (MS-ROI), by Brandeis University, is briefly presented. In this paper:

A new CNN architecture is proposed for image compression.
The CNN generates a map that highlights semantically-salient regions so that they can be encoded at higher quality as compared to background regions.

This is a paper in 2017 DCC. (Sik-Ho Tsang @ Medium)

The CNN network architecture consists of several stacked layers of convolution, as shown above.
Fully connected layers, originally for image classification task, typically added on top of the traditional CNN with the aim of producing the predicted categorical output, are removed and replaced with a global average pooling (GAP) layer:

The GAP operation calculates the spatial average of each feature map (a three-dimensional tensor) from the convolutional layer preceding the GAP layer, reducing each feature map to a single value.
The resulting vector is fed directly into the final, Softmax layer, which outputs the model’s prediction.
The weights connecting the GAP layer to the output layer encode the contribution of each feature map to the predicted class — the bigger the contribution of a specific detected visual pattern, the more weight it is given.
A saliency map is obtained by mapping the weights of the final layer back to the last convolutional layer and calculating a weighted sum of the feature maps.
Rather than picking only the most probable class (the highest activation), the activations are sorted from the index of the element with the lowest value to the index of the element with the highest value.
A weighted sum of the five highest-scoring classes is taken, while the less probable classes are discarded.
By applying a colour-map consisting of a range of cold and warm colours over the obtained greyscale saliency map, the final localization is expressed as a heatmap, which highlights the discriminative ROIs specific to the predicted classes. The most important image regions are represented with the red colour, whereas the least important image regions are represented with the dark blue colour.

MS-ROI map has a saliency value for each pixel in the range [0,1].
These saliency values are discretized into k levels.
A range of jpeg quality levels, Ql to Qh. Each saliency level will be compressed using a Q value drawn from this range, corresponding to that level:

For each 8×8 block of the output image, a quality level Q is chosen corresponding to that block’s saliency level.

The CNN model is trained with the Caltech-256 dataset.
And it is tested on the Kodak PhotoCD set (24 images) and the the MIT dataset (2,000 images).
The heat-map is discretized into five levels and use jpeg quality levels Q in increments of ten from Ql = 30 to Qh = 70. For all experiments, the file size of the standard jpeg image and the jpeg obtained from the CNN model were kept within 1% of each other.
On average, salient regions were compressed at Qf = 65, and non-salient regions were compressed at Q = 45. The overall Q for the final image generated using the CNN model was Q = 55, whereas for all standard jpeg samples, Q was chosen to be 50.