Tutorial: Sinkhorn Distance

Sinkhorn Distance for Optimal Transport

Sik-Ho Tsang
5 min readJun 26, 2023

References:
1.
Sinkhorn Distances: Lightspeed Computation of Optimal Transport, by Kyoto University, 2013 NIPS, Over 3100 Citations.
2.
POT: Python Optimal Transport, by Various Researchers, 2021 JMLR, Over 480 Citations. (Sik-Ho Tsang @ Medium)
3. Optimal Transport and the Sinkhorn Transformer
https://www.pragmatic.ml/sparse-sinkhorn-attention/.
4. Hands-on guide to Python Optimal Transport toolbox: Part 1
https://towardsdatascience.com/hands-on-guide-to-python-optimal-transport-toolbox-part-1-922a2e82e621.

  • In this story, I would like to give an example instead of mathematical explanation.

Outline

  1. Optimal Transport Distances
  2. Illustrative Example
  3. Earth Movers Distance (EMD)
  4. Sinkhorn Distance With Entropic Regularization

1. Optimal Transport Distances

Transport polytope U(r, c) (Please read the paper directly for more mathematics.)

1.1. Transport Polytope and Interpretation as a Set of Joint Probabilities

Optimal Transport Distances are a fundamental family of distances for probability measures and histograms of features.

  • It has excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive.
  • Let U(r, c) as for the transport polytope of r and c, namely the polyhedral set of d×d matrices:

U(r, c) contains all non-negative d×d matrices with row and column sums r and c respectively.

  • The set U(r, c) contains all possible joint probabilities of (X, Y). Indeed, any matrix P U(r, c) can be identified with a joint probability for (X, Y) such that p(X = i, Y = j) = pij.
  • The entropy h and the Kullback-Leibler (KL) divergences of P, QU(r, c) and a marginals r ∈ Σd as:

1.2. Optimal Transport Distance Between r and c

As mentioned, any matrix P can be defined as solution. But of course, we want to have a P such that the lowest/minimum cost is obtained.

  • Given a d×d cost matrix M, the cost of mapping r to c using a transport matrix (or joint probability) P can be quantified as <P,M>.

The problem above is called an optimal transport (OT) problem between r and c given cost M. An optimal table P* can be obtained for this problem.

2. Illustrative Example

Illustrative Example

Let say each factory is capable of different rates of production for a product, and each region has different demand. Each factory / distribution center pair has a unique transport cost.

This problem is a prime example of a transport problem!

  • Thus, based on the array at the right hand side, we can obtain:
factory = np.array([0.1, 0.2, 0.4, 0.2, 0.1])
center = np.array([0.05, 0.05, 0.2, 0.3, 0.4])
M = np.array([[0, 1, 2, 3, 4],
[1, 0, 1, 2, 3],
[2, 1, 0, 1, 2],
[3, 2, 1, 0, 1],
[4, 3, 2, 1, 0]])

3. Earth Movers Distance (EMD)

Earth Movers Distance (EMD) (From https://pythonot.github.io/ by POT: Python Optimal Transport)
  • Earth Movers Distance (EMD) can be used to solve the problem.
  • In POT: Python Optimal Transport, they provide Python library function, ot.emd, to do that.
gamma_emd = ot.emd(factory, center, M)
Results
  • A sparse matrix with exact values is obtained.

One problem of EMD is the computational complexity which has big O of O(n³log(n)). When the problem is small-scaled as above, it is still fine. When the problem is in large scale, then EMD is computationally expensive.

Sinkhorn Distance With Entropic Regularization can be used here for fast computation.

4. Sinkhorn Distance With Entropic Regularization

Sinkhorn Distance With Entropic Regularization (From https://pythonot.github.io/ by POT: Python Optimal Transport)
  • When Sinkhorn Distance is used, we can see there is a entropic regularization term is added.
  • In POT: Python Optimal Transport, they provide Python library function, ot.sinkhorn:
gamma_sinkhorn = ot.sinkhorn(factory, center, reg=0.1, M=M_array/M_array.max())
Results
  • We can change the value reg to other values, let say 0.5:
Results
  • Thus, the matrix is not sparse, every factory needs to give the product to different distribution centers with different portions.
  • If we sum all items, they are summed to 1:
gamma_sinkhorn.sum()
  • We can also estimate it by our own:
def own_sinkhorn(a, b, reg, M):
K = np.exp(-M / M.max() / reg)
nit = 100
u = np.ones((len(a), ))

for i in range(1, nit):
v = b / np.dot(K.T, u)
u = a / (np.dot(K, v))

return np.atleast_2d(u).T * (K * v.T) # Equivalent to np.dot(np.diag(u), np.dot(K, np.diag(v)))

own = own_sinkhorn(factory, center, reg=0.1, M=M_array)
print(own)
Results
  • The same result is obtained as above.
  • Indeed, what it is doing within for loop is to dividing the row & column by each time.
  • There are still many usages of it at different aspects, such as computer vision or NLP.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.