Review: Image Transformer

Image Generation and Super Resolution Using Transformer

4 min readNov 30, 2021

--

Left: Super Resolution, Right: Image Generation (Input / Image Transformer Results / Ground Truth)

In this story, Image Transformer, by Google Brain, University of California, and Google AI, is briefly reviewed. In this paper:

Self-Attention Transformer is used for image generation.
By restricting the self-attention mechanism to attend to local neighborhoods, the size of images the model can process is significantly increased.

This is a paper in 2018 ICLR with over 600 citations. (Sik-Ho Tsang @ Medium)

Outline

Image Transformer
Local Self-Attention
Experimental Results

1. Image Transformer

For image-conditioned generation, as in our super-resolution models, we use an encoder-decoder architecture.
The encoder generates a contextualized, per-pixel-channel representation of the source image.
The decoder autoregressively generates an output image of pixel intensities, one channel per pixel at each time step.
For unconditional and class-conditional generation, the Image Transformer is employed in a decoder-only configuration.
(Please feel free to read Transformer if interested.)

1.1. One Layer of Image Transformer

A slice of one layer of the Image Transformer

A slice of one layer of the Image Transformer, recomputing the representation q0 of a single channel of one pixel q by attending to a memory of previously generated pixels m1, m2, … .
After performing local self-attention, a two-layer position-wise feed-forward neural network is applied with the same parameters for all positions in a given layer.
Self-attention and the feed-forward networks are followed by dropout and bypassed by a residual connection with subsequent layer normalization.
The position encodings pq, p1, … are added only in the first layer.

2. Local Self-Attention

There are two settings for Image Transformer:

1D Local Attention and 2D Local Attention

2.1. 1D Local Attention

The input tensor with positional encodings is flattened in raster-scan order.
For each query block, the memory block M is built from the same positions as Q and an additional lm positions corresponding to pixels that have been generated before, which can result in overlapping memory blocks.

2.2. 2D Local Attention

The input tensor with positional encodings is partitioned into rectangular query blocks contiguous in the original image space.
The image is generated one query block after another, ordering the blocks in raster-scan order.

3. Experimental Results

3.1. Image Generation

Image completions from the proposed best conditional generation model

Conditional image generations for all CIFAR-10 categories

Unlike GAN, image generation is performed pixel by pixel which has the log likelihoods in logit space, bits per dimension can be measured.
Left: Images on the left are from a model that achieves 3.03 bits/dim on the test set.
Right: Images on the right are from the proposed best non-averaged model with 2.99 bits/dim.
Both models are able to generate convincing cars, trucks, and ships. Generated horses, planes, and birds also look reasonable.

Bits/dim on CIFAR-10 test and ImageNet validation sets

The Image Transformer outperforms all models and matches Pixel-CNN++, achieving a new state-of-the-art on ImageNet. Increasing memory block size (bsize) significantly improves performance.

3.2. Super Resolution

Four-fold super-resolution model trained on CIFAR-10, images look realistic and plausible

3.3. Super Resolution on CelebA

Negative log-likelihood and human eval performance for the Image Transformer on CelebA

Images from our 1D and 2D local attention super-resolution models trained on CelebA, sampled with different temperatures

τ is to tempered softmax, proposed by Dahl, 2017.
An 8×8 pixel image is enlarged four-fold to 32×32.
35% to 36% of humans are fooled, which is significantly better than the previous state of the art.

This is a paper to utilize Transformer proposed in Attention is All You Need for image generation and image super resolution.

Reference

[2018 ICML] [Image Transformer]
Image Transformer

Single Image Super Resolution (SISR)

2014–2016: [SRCNN]
2016: [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN]
2017: [DnCNN] [DRRN] [LapSRN & MS-LapSRN] [MemNet] [IRCNN] [WDRN / WavResNet] [SRDenseNet] [SRGAN & SRResNet] [SelNet] [CNF] [BT-SRN] [EDSR & MDSR] [EnhanceNet]
2018: [MWCNN] [MDesNet] [RDN] [SRMD & SRMDNF] [DBPN & D-DBPN] [RCAN] [ESRGAN] [CARN] [IDN] [ZSSR] [MSRN] [Image Transformer]
2019: [SR+STN] [IDBP-CNN-IA] [SRFBN] [OISR]
2020: [PRLSR] [CSFN & CSFN-M]

My Other Previous Paper Readings

Super Resolution

Image Generation

Artificial Intelligence

Sik-Ho Tsang

Written by Sik-Ho Tsang

13.6K Followers

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams