Reading: IDN — Information Distillation Network (Super Resolution)

With Faster Execution Time, Outperforms MemNet, DRRN, LapSRN, DRCN & VDSR

6 min readJul 19, 2020

**PSNR Against Execution Time, IDN has higher PSNR with faster processing time**

In this story, Fast and Accurate Single Image Super-Resolution via Information Distillation Network (IDN), by Xidian University, is presented. In this story:

The network has 3 parts: feature extraction block, stacked information distillation blocks and reconstruction block.
By combining an enhancement unit with a compression unit into a distillation block, the local long and short-path features can be effectively extracted.
Fast execution since comparatively few numbers of filters per layer are used due to the use of group convolution.

This is a paper in 2018 CVPR with over 100 citations. (Note: The distillation here is not model/knowledge distillation.) (Sik-Ho Tsang @ Medium)

Outline

IDN: Network Architecture
Distillation Block
Experimental Results

1. IDN: Network Architecture

The proposed IDN, as shown in the figure, consists of three parts: a feature extraction block (FBlock), multiple stacked information distillation blocks (DBlocks) and a reconstruction block (RBlock).
x and y are denoted as the input and the output of IDN.
With respect to FBlock, two 3×3 convolutional layers are utilized to extract the feature maps from the original LR image.

where f represents the feature extraction function and B0 denotes the extracted features.
The next part is composed of multiple information distillation blocks by using chained mode. Each block, DBlock, contains an enhancement unit and a compression unit with stacked style.

where Fk denotes the k-th DBlock function, Bk-1 and Bk indicate the input and output of the k-th DBlock respectively.
Finally, we take a transposed convolution without activation function as the RBlock.
The whole IDN becomes:

Authors first train the network with MAE loss and then fine-tune it by MSE loss.

2. Distillation Block

2.1. Enhancement Unit

Orange circle represents slice operation and purple circle indicates concatenation operation in channel dimension.
The enhancement unit can be roughly divided into two modules, one is the above three convolutions and another is the below three convolutions.
The above module has three 3×3 convolutions, each of them is followed by a LReLU.
Let’s denote the feature map dimensions of the i-th layer as Di. The dimension of channels in the above module:

Similar for the dimension of channels in the below module:

where D4 = D3.
The output of the third convolution layer is sliced into two segments.

where Ca indicates chained convolutions operation and Pk1 is the output of the above module in the k-th enhancement unit.
The feature maps with D3/s dimensions of Pk1 and the input of the first convolutional layer are concatenated in the channel dimension:

where C, S represent concatenation operation and slice operation respectively. Specifically, we know the dimension of Pk1 is D3. Therefore, S(Pk1, 1/s) denotes that D3/s dimensions features are fetched from Pk1.
Moreover, S(Pk1,1/s) concatenates features with Bk-1 in channel dimension.

The purpose is to combine the previous information with some current information. It can be regarded as partially retained local short-path information.

The rest of local short-path information is taken as the input of the below module, which mainly further extracts long-path feature maps:

where Pk2 , Cb are the output and stacked convolution operations of the below module respectively.
Finally, the input information, the reserved local short-path information and the local long-path information are aggregated:

2.2. Compression Unit

The outputs of the enhancement unit are sent to a 1×1 convolution layer, which acts as dimensionality reduction or distilling relevant information for the later network.

3. Experimental Results

3.1. Training

91 images from Yang and 200 images from Berkeley Segmentation Dataset (BSD).
Data augmentation in three ways: (1) Rotate the images with the degree of 90, 180 and 270. (2) Flip images horizontally. (3) Downscale the images with the factor of 0.9, 0.8, 0.7 and 0.6.

The training process will be unstable due to the larger size training samples equipped with the larger learning rate.
Therefore, for example, 15²/43² training pairs are generated for training stage and 26²/76² LR/HR sub-images pairs are utilized for fine-tuning phase.
Finally, a 31-layer network is used as IDN.
IDN has 4 DBlocks, and the parameters D3, d and s of enhancement unit in each block are set to 64, 16 and 4 respectively.
To reduce the parameters of network, the grouped convolution layer, originated in AlexNet & ResNeXt, is used in the second and fourth layers in each enhancement unit with 4 groups.
In addition, the transposed convolution adopts 17×17 filters for all scaling factors and the negative scope of LReLU is set as 0.05.

3.2 Testing

**Average PSNR/SSIMs for scale ×2, ×3 and ×4. Red color indicates the best and blue color indicates the second best performance.**

IDN performs favorably against state-of-the-art results on most datasets.
The performance of the proposed IDN is lower than that of MemNet in Urban100 dataset and ×3, ×4 scale factors, while IDN can achieve slightly better performance in other benchmark datasets.
Authors think that MemNet takes an interpolated LR image as its input so that more information is fed into the network and the process of the SR only needs to correct the interpolated image.

Information fidelity criterion (IFC) metric is also used, which assesses the image quality based on natural scene statistics and correlates well with human perception of image super-resolution.
IDN achieves the best performance and outperforms MemNet by a considerable margin.