# Brief Review — AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data

## AET, Auto-Encoding Transformations, Predicts Transformation Instead of Pixels, for Self-Supervised Learning

--

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data,AET, by Laboratory for MAchine Perception and LEarning (MAPLE), Huawei Cloud, University of Central Florida, and University of Rochester2019 CVPR, Over 170 Citations(Sik-Ho Tsang @ Medium)

Self-Supervised Learning, Image Classification

- Given a randomly sampled transformation,
**Auto-Encoding Transformations (AET)**seeks to**predict transformation merely from the encoded features**as accurately as possible at the output end. - The idea is that, as long as
**the unsupervised features successfully encode the essential information about the visual structures**of original and transformed images, the transformation can be well predicted.

# Outline

**Auto-Encoding Transformations (AET)****Results**

**1. Auto-Encoding Transformations (AET)**

## 1.1. Overall Framework

**A transformation**is sampled from a distribution*t**T*. It is applied to**an image**drawn from a data distribution*x**X*, resulting in the**transformed version**of*t*(*x*)*x*.

The goal is to learn an

encoder, which aims toE:x→E(x)extract the representation. Meanwhile, we wish toE(x) for a samplexlearn a decoder, whichDgives an estimate ˆby decoding from the encoded representations of original and transformed images.tof input transformation

- AET now boils down to
**jointly training the feature encoder**. A*E*and the transformation decoder*D***loss function**is used, that quantifies the*ℓ*(*t*, ˆ*t*)**difference between a transformation**:*t*and its estimate ˆ*t*

- where the
**transformation estimate ˆ**is:*t*

## 1.2. AET Family

- There are
**three genres**to instantiate the AET models:**parameterized**,**GAN****-induced**and**non-parameterized**transformations.

**1.2.1. Parameterized Transformations**

- A family of transformations
*T*with parameters*θ*sampled from a distribution*Θ*such as using**transformation matrix**using*M*(*θ*)**affine and projective transformations**. The loss function becomes:

Only parameterized transformations are used in experiments. And there are

AET-affineandAET-projectversions.

- For
**AET-affine**, the affine transformation is a composition of a random rotation with [−180◦, 180◦], a random translation by ±0.2 of image height and width in both vertical and horizontal directions, and a random scaling factor of [0.7, 1.3], along with a random shearing of [−30◦, 30◦] degree. - For
**AET-project**, the projective transformation is formed by randomly translating four corners of an image in both horizontal and vertical directions by ±0.125 of its height and width, after it is randomly scaled by [0.8, 1.2] and rotated by 0◦, 90◦, 180◦, or 270◦.

## 1.2.2. GAN-Induced Transformations

- (It is not used in experiments. Please skip for fast read.)
**A****GAN****generator***G*

- where
is*z***sampled noise**. The loss function becomes:

## 1.2.3. Non-Parametric Transformations

- (It is not used in experiments. Please skip for fast read.)
- Non-parametric transformation is difficult to parameterized. Thus, the
**loss**is estimated by measuring the**average difference between the transformations of randomly sampled images**:

- where dist(·, ·) is a distance between two transformed images.

# 2. Results

## 2.1. CIFAR-10

- The network
**NIN** - AET has two NIN branches, sharing the same network weights.
- The output features of the forth block of two branches are concatenated and average-pooled to form a 384-d feature vector. Then an output layer follows to predict the parameters of input transformation.

- A classifier is built on top of the second convolutional block. The first two blocks are frozen.
**FC**: A non-linear classifier with**3 Fully-Connected (FC) layers**is trained.**Conv**: A convolutional classifier is built upon the unsupervised features by**adding a third****NIN****block**whose output feature map is**averaged pooled**and connected to a linear**softmax**classifier.

The unsupervised AET-project with the convolutional classifier almost achieves the same error rate as its fully supervisedNINcounterpart with four convolutional blocks (7.82% vs. 7.2%), which is a remarkable.

**KNN**: Model-free KNN classifier based on the averaged-pooled output features.

The model-free KNN results suggest the

AET model has an advantage when no labels are available in training classifiers upon the unsupervised features.

Compared with RotNet, the results show that

AET-project can consistently achieve the smallest errors no matter which classifiers are used.

## 2.2. ImageNet

- Two AlexNet branches with shared parameters are used.
- The
**4,096-d output features**from the second last fully connected layer in two branches are**concatenated**and fed into the output layer**producing eight projective transformation parameters**.

The results show

AET outperformsBiGANby a significant lead, suggesting its advantage over the GAN and AED paradigms at least in this experiment setting.

A 1000-way linear classifier is trained on top of different numbers of convolutional layers. Again,

AET obtains the best accuracy among all the compared unsupervised models.

## 2.3. Places

AET models outperform the other unsupervised models in most of cases, except on Conv1 and Conv2, Counting performs slightly better.

## 2.4. Analysis

**The trend of transformation prediction loss**(i.e. the AET loss being minimized to train the model)**is well aligned with that of classification error and Top-1 accuracy**on CIFAR-10 and ImageNet.

This suggests that

better prediction of transformationsisa good surrogate of better classification resultby using the learned features.

The above examples show how well the model can decode the transformations from the encoded image features.

## Reference

[2019 CVPR] [AET]

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data

## 1.2. Unsupervised/Self-Supervised Learning

**1993** … **2019** [AET] … **2021** [MoCo v3] [SimSiam] [DINO] [Exemplar-v1, Exemplar-v2] [MICLe] [Barlow Twins] [MoCo-CXR] [W-MSE] [SimSiam+AL] [BYOL+LP] **2022** [BEiT] [BEiT V2]