Review — t-SNE: Visualizing Data using t-SNE (Data Visualization)

Visualizing High-Dimensional Data in Low-Dimensional Space

t-SNE on MNIST Data set (From Google TechTalk by the First Author, Laurens van der Maaten, https://www.youtube.com/watch?v=RJVL80Gg3lA)
  • t-SNE reduces the crowding problem, compared to SNE.
  • t-SNE has been used in various fields for data visualization.

Outline

  1. Dimensionality Reduction for Data Visualization
  2. Brief Review of SNE
  3. t-Distributed Stochastic Neighbor Embedding (t-SNE)
  4. Experimental Results
  5. More Results in Google TechTalk

1. Dimensionality Reduction for Data Visualization

  • Suppose we have high-dimensional data set X = {x1, x2, …, xn}, and we want to reduce the dimension into two or three-dimensional data Y = {y1, y2, …, yn} that can be displayed in a scatterplot.
  • In the paper, the low-dimensional data representation Y is referred as a map, and to the low-dimensional representations yi of individual data points as map points.
  • t-SNE is proposed for mapping the high-dimensional data set X to low-dimensional data representation Y, while preserving the high-dimensional data structure.

2. Brief Review of SNE

2.1. SNE

  • Stochastic Neighbor Embedding (SNE) starts by converting the high-dimensional Euclidean distances between datapoints into conditional probabilities that represent similarities.
  • Mathematically, the conditional probability pj|i is:
  • The similarity of map point yj to map point yi is modelled by:
  • SNE minimizes the sum of Kullback-Leibler divergences over all datapoints using a gradient descent method. The cost function C is given by:
  • In particular, there is a large cost for using widely separated map points to represent nearby datapoints (i.e., for using a small qj|i to model a large pj|i), but there is only a small cost for using nearby map points to represent widely separated datapoints.
  • In other words, the SNE cost function focuses on retaining the local structure of the data in the map.

2.2. Selection of σi

  • In dense regions, a smaller value of σi is usually more appropriate than in sparser regions. SNE performs a binary search for the value of σi that produces a Pi with a fixed perplexity that is specified by the user. The perplexity is defined as:

2.3. Gradient Descent

  • The minimization of the cost function C is performed using a gradient descent method:
  • The spring between yi and yj repels or attracts the map points depending on the distance.
  • The force exerted by the spring between yi and yj is proportional to its length, and also proportional to its stiffness, which is the mismatch (pj|i-qj|i+pi|j-qi|j) between the pairwise similarities of the data points and the map points.
  • The gradient descent is initialized by sampling map points randomly from an isotropic Gaussian with small variance that is centered around the origin.
  • To avoid poor local minima, a relatively large momentum term is added to the gradient. The gradient update with a momentum term is given by:
  • (More details can be found in the paper.)

3. t-Distributed Stochastic Neighbor Embedding (t-SNE)

  • So, what’s new in t-SNE? t-SNE solves the “crowding problem” originally in SNE.
  • t-SNE is capable of capturing much of the local structure of the high-dimensional data very well, while also revealing global structure such as the presence of clusters at several scales.
  • The cost function used by t-SNE differs from the one used by SNE in two ways:
  1. It uses a Student-t distribution rather than a Gaussian to compute the similarity between two points in the low-dimensional space. t-SNE employs a heavy-tailed distribution in the low-dimensional space to alleviate both the crowding problem and the optimization problems of SNE.

3.1. Symmetric SNE

  • Symmetric SNE minimizes a single Kullback-Leibler divergence between a joint probability distribution, P, in the high-dimensional space and a joint probability distribution, Q, in the low-dimensional space:
pairwise similarities in the low-dimensional map qij
Pairwise similarities in the high-dimensional space pij

3.2. The Crowding Problem

  • For instance, in 10 dimensions, it is possible to have 11 datapoints that are mutually equidistant and there is no way to model this faithfully in a two-dimensional map.
  • So if the datapoints are approximately uniformly distributed in the region around i on the 10-dimensional manifold, and we try to model the distances from i to the other datapoints in the 2-dimensional map, we get the following “crowding problem”.
  • The area of the 2-dimensional map is not large enough to accommodate 11 datapoints.
  • To solve this, in the low-dimensional map, we can use a probability distribution that has much heavier tails than a Gaussian to convert distances into probabilities.
  • This allows a moderate distance in the high-dimensional space to be faithfully modeled by a much larger distance in the map and, as a result, it eliminates the unwanted attractive forces between map points that represent moderately dissimilar datapoints.
  • In t-SNE, a Student t-distribution with one degree of freedom (which is the same as a Cauchy distribution) is employed as the heavy-tailed distribution in the low-dimensional map.
  • Using this distribution, the joint probabilities qij are defined as:
The repelling or attracting force
Total force
  • (There are still many details not yet mentioned here. If interested, please feel free to read the paper.)

4. Experimental Results

  • In all experiments, PCA is used to reduce the dimensionality of the data to 30 first. This speeds up the computation of pairwise distances between the datapoints and suppresses some noise without severely distorting the interpoint distances.
  • Then , the 30-dimensional representation is reduced to a two-dimensional map using t-SNE and the resulting map is shown as a scatterplot.
Visualizations of the MNIST data set using t-SNE
Visualizations of the Olivetti faces data set
Visualizations of the COIL-20 data set
  • (Again, there are large portions of paragraphs for the experiments that hasn’t shown here, please feel free to read the paper if interested.)

5. More Results in Google TechTalk

  • In 2013, the first author, Laurens van der Maaten, presented t-SNE in Google TechTalk.
  • (It is crazily amazing that t-SNE was published in 2008, it was presented in Google TechTalk in 2013, while it is still useful right now !!)
t-SNE on CIFAR10
t-SNE on SVHN
t-SNE on Other Image Dataset

References

[2008 JMLR] [t-SNE]
Visualizing Data using t-SNE

Data Visualization

2002 [SNE] 2008 [t-SNE]

My Other Previous Paper Readings

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG