# Tutorial: Sinkhorn Distance

## Sinkhorn Distance for Optimal Transport

References:

1.Sinkhorn Distances: Lightspeed Computation of Optimal Transport, by Kyoto University,2013 NIPS,Over 3100 Citations.

2.POT: Python Optimal Transport, by Various Researchers,2021 JMLR,Over 480 Citations.(Sik-Ho Tsang @ Medium)https://www.pragmatic.ml/sparse-sinkhorn-attention/.

3. Optimal Transport and the Sinkhorn Transformer4. Hands-on guide to Python Optimal Transport toolbox: Part 1

https://towardsdatascience.com/hands-on-guide-to-python-optimal-transport-toolbox-part-1-922a2e82e621.

- In this story, I would like to give
**an example**instead of mathematical explanation.

## Outline

**Optimal Transport Distances****Illustrative Example****Earth Movers Distance (EMD)****Sinkhorn Distance With Entropic Regularization**

**1. Optimal Transport Distances**

- This section is mainly from
**2013 NIPS:****Sinkhorn Distances: Lightspeed Computation of Optimal Transport****.**

## 1.1. Transport Polytope and Interpretation as a Set of Joint Probabilities

Optimal Transport Distancesarea fundamental family of distances for probability measures and histograms of features.

- It has excellent performance in
**retrieval tasks**and**intuitive formulation**, their**computation**involves the resolution of a linear program whose cost can**quickly become prohibitive**. - Let
as for the*U*(*r*,*c*)**transport polytope of**, namely the*r*and*c***polyhedral set of**:*d*×*d*matrices

containsU(r,c)all non-negativerespectively.d×dmatrices with row and column sumsrandc

- The set
contains all possible joint probabilities of (*U*(*r*,*c*)*X*,*Y*). Indeed,**any matrix**such that*P*∈*U*(*r*,*c*) can be identified with a joint probability for (*X*,*Y*)*p*(*X*=*i*,*Y*=*j*) =*pij*. - The
**entropy**and the*h***Kullback-Leibler (KL)**divergences of*P*,*Q*∈*U*(*r*,*c*) and**a marginals***r**d*as:

## 1.2. Optimal Transport Distance Between *r* and *c*

As mentioned, any matrix

Pcan be defined as solution. But of course,we want to have a P such that the lowest/minimum cost is obtained.

- Given a
*d*×*d*cost matrix*M*,**the cost of mapping***r*to*c*using a transport matrix (or joint probability)*P***<**.*P*,*M*>

The problem above is called

an optimal transport (OT) problembetween. Anrandcgiven costMoptimal tablecan be obtained for this problem.P*

# 2. Illustrative Example

- This section is from the tutorial website,
**Optimal Transport and the Sinkhorn Transformer**,**Hands-on guide to Python Optimal Transport toolbox: Part 1**, and**2021 JMLR:****POT: Python Optimal Transport**

Let say

each factoryis capable ofdifferent rates of production for a product, andeach regionhasdifferent demand. Each factory / distribution center pairhas aunique transport cost.This problem is a prime example of a

transport problem!

- Thus, based on the array at the right hand side, we can obtain:

`factory = np.array([0.1, 0.2, 0.4, 0.2, 0.1])`

center = np.array([0.05, 0.05, 0.2, 0.3, 0.4])

M = np.array([[0, 1, 2, 3, 4],

[1, 0, 1, 2, 3],

[2, 1, 0, 1, 2],

[3, 2, 1, 0, 1],

[4, 3, 2, 1, 0]])

# 3. **Earth Movers Distance (EMD)**

**Earth Movers Distance (EMD)**can be used to solve the problem.- In
**POT: Python Optimal Transport**, they provide Python library function,, to do that.*ot.emd*

`gamma_emd = ot.emd(factory, center, M)`

**A****sparse matrix with exact values**is obtained.

One problem of

EMDis the computational complexity which has big O of O(n³log(n)). When the problem is small-scaled as above, it is still fine. When the problem is inlarge scale, then EMD iscomputationally expensive.

Sinkhorn Distance With Entropic Regularizationcan be used here forfast computation.

# 4. **Sinkhorn Distance With Entropic Regularization**

- When Sinkhorn Distance is used, we can see there is a
**entropic regularization term**is added. - In
**POT: Python Optimal Transport**, they provide Python library function,:*ot.sinkhorn*

`gamma_sinkhorn = ot.sinkhorn(factory, center, reg=0.1, M=M_array/M_array.max())`

- We can change the value
to other values, let say*reg***0.5**:

- Thus, the matrix is not sparse, every factory needs to give the product to different distribution centers with different portions.
- If we sum all items, they are
**summed to 1**:

`gamma_sinkhorn.sum()`

- We can also estimate it
**by our own**:

`def own_sinkhorn(a, b, reg, M):`

K = np.exp(-M / M.max() / reg)

nit = 100

u = np.ones((len(a), ))

for i in range(1, nit):

v = b / np.dot(K.T, u)

u = a / (np.dot(K, v))

return np.atleast_2d(u).T * (K * v.T) # Equivalent to np.dot(np.diag(u), np.dot(K, np.diag(v)))

own = own_sinkhorn(factory, center, reg=0.1, M=M_array)

print(own)

- The
**same result**is obtained as above. - Indeed, what it is doing within for loop is to
**dividing the row & column by each time.**

- There are still many usages of it at different aspects, such as computer vision or NLP.