Brief Review — AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data

AET, Auto-Encoding Transformations, Predicts Transformation Instead of Pixels, for Self-Supervised Learning

Sik-Ho Tsang
6 min readOct 14, 2022

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data,
AET, by Laboratory for MAchine Perception and LEarning (MAPLE), Huawei Cloud, University of Central Florida, and University of Rochester
2019 CVPR, Over 170 Citations (Sik-Ho Tsang @ Medium)
Self-Supervised Learning, Image Classification

  • Given a randomly sampled transformation, Auto-Encoding Transformations (AET) seeks to predict transformation merely from the encoded features as accurately as possible at the output end.
  • The idea is that, as long as the unsupervised features successfully encode the essential information about the visual structures of original and transformed images, the transformation can be well predicted.


  1. Auto-Encoding Transformations (AET)
  2. Results

1. Auto-Encoding Transformations (AET)

1.1. Overall Framework

An illustrative comparison between AED and AET, where AET attempts to estimate the input transformation rather than the data at the output end
  • A transformation t is sampled from a distribution T. It is applied to an image x drawn from a data distribution X, resulting in the transformed version t(x) of x.

The goal is to learn an encoder E: xE(x), which aims to extract the representation E(x) for a sample x. Meanwhile, we wish to learn a decoder D, which gives an estimate ˆt of input transformation by decoding from the encoded representations of original and transformed images.

  • AET now boils down to jointly training the feature encoder E and the transformation decoder D. A loss function (t, ˆt) is used, that quantifies the difference between a transformation t and its estimate ˆt:
  • where the transformation estimate ˆt is:

1.2. AET Family

  • There are three genres to instantiate the AET models: parameterized, GAN-induced and non-parameterized transformations.

1.2.1. Parameterized Transformations

  • A family of transformations T with parameters θ sampled from a distribution Θ such as using transformation matrix M(θ) using affine and projective transformations. The loss function becomes:

Only parameterized transformations are used in experiments. And there are AET-affine and AET-project versions.

  • For AET-affine, the affine transformation is a composition of a random rotation with [−180◦, 180◦], a random translation by ±0.2 of image height and width in both vertical and horizontal directions, and a random scaling factor of [0.7, 1.3], along with a random shearing of [−30◦, 30◦] degree.
  • For AET-project, the projective transformation is formed by randomly translating four corners of an image in both horizontal and vertical directions by ±0.125 of its height and width, after it is randomly scaled by [0.8, 1.2] and rotated by 0◦, 90◦, 180◦, or 270◦.

1.2.2. GAN-Induced Transformations

  • (It is not used in experiments. Please skip for fast read.)
  • A GAN generator G that transforms an input over the manifold of real images can be used. But it is arranged as future work.
  • where z is sampled noise. The loss function becomes:

1.2.3. Non-Parametric Transformations

  • (It is not used in experiments. Please skip for fast read.)
  • Non-parametric transformation is difficult to parameterized. Thus, the loss is estimated by measuring the average difference between the transformations of randomly sampled images:
  • where dist(·, ·) is a distance between two transformed images.

2. Results

2.1. CIFAR-10

An illustration of the network architectures for training and evaluating AET on the CIFAR-10 dataset
  • The network NIN is used, which consists of 4 convolutional blocks, each of which contains 3 convolutional layers.
  • AET has two NIN branches, sharing the same network weights.
  • The output features of the forth block of two branches are concatenated and average-pooled to form a 384-d feature vector. Then an output layer follows to predict the parameters of input transformation.
Comparison between unsupervised feature learning methods on CIFAR-10
  • A classifier is built on top of the second convolutional block. The first two blocks are frozen.
  • FC: A non-linear classifier with 3 Fully-Connected (FC) layers is trained.
  • Conv: A convolutional classifier is built upon the unsupervised features by adding a third NIN block whose output feature map is averaged pooled and connected to a linear softmax classifier.

The unsupervised AET-project with the convolutional classifier almost achieves the same error rate as its fully supervised NIN counterpart with four convolutional blocks (7.82% vs. 7.2%), which is a remarkable.

The comparison of the KNN error rates by different models with varying numbers K of nearest neighbors on CIFAR-10
  • KNN: Model-free KNN classifier based on the averaged-pooled output features.

The model-free KNN results suggest the AET model has an advantage when no labels are available in training classifiers upon the unsupervised features.

Comparison of RotNet vs. AETs on CIFAR-10 with different classifiers on top of learned representations for evaluation

Compared with RotNet, the results show that AET-project can consistently achieve the smallest errors no matter which classifiers are used.

2.2. ImageNet

  • Two AlexNet branches with shared parameters are used.
  • The 4,096-d output features from the second last fully connected layer in two branches are concatenated and fed into the output layer producing eight projective transformation parameters.
Top-1 accuracy with non-linear layers on ImageNet

The results show AET outperforms BiGAN by a significant lead, suggesting its advantage over the GAN and AED paradigms at least in this experiment setting.

Top-1 accuracy with linear layers on ImageNet

A 1000-way linear classifier is trained on top of different numbers of convolutional layers. Again, AET obtains the best accuracy among all the compared unsupervised models.

2.3. Places

Top-1 accuracy on the Places dataset with linear layers

AET models outperform the other unsupervised models in most of cases, except on Conv1 and Conv2, Counting performs slightly better.

2.4. Analysis

Error rate(top-1 accuracy) vs. AET loss over epochs on the CIFAR-10 and ImageNet datasets
  • The trend of transformation prediction loss (i.e. the AET loss being minimized to train the model) is well aligned with that of classification error and Top-1 accuracy on CIFAR-10 and ImageNet.

This suggests that better prediction of transformations is a good surrogate of better classification result by using the learned features.

Some examples of original images (top), along with the counterparts of input (middle) and predicted (bottom) transformations by the AET model

The above examples show how well the model can decode the transformations from the encoded image features.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.