Brief Review — Visualizing the Loss Landscape of Neural Nets

Loss Landscape Visualization, Looking for Sharpness

The loss surfaces of -56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures.

Visualizing the Loss Landscape of Neural Nets,
Loss Landscape, by University of Maryland, United States Naval Academy, and Cornell University,
2018 NIPS, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Sharpness, Visualization

  • Filter-Wise Normalization is proposed to calculate the loss landscape for visualization.
  • Based on the visualization, we can know the importance of deeper layers, wider layers, and skip connection, and so on.

Outline

  1. Filter-Wise Normalization for Loss Landscape
  2. Results

1. Filter-Wise Normalization for Loss Landscape

  • Authors point out that some prior approaches are misleading or not good enough for visualization. (Please kindly read the paper directly if interested.) Thus, authors propose a new one here.
  • Assume we have parameters θ* for the network. This θ* is used as the center point in the graph.
  • Two random direction vectors, δ and η, are chosen.
  • Each sampled from a random Gaussian distribution with appropriate scaling:
  • where di,j represents the jth filter (not the jth weight) of the ith layer of d., and ||·|| denotes the Frobenius norm.
  • It is noted that neural nets are scale invariant; if the small-parameter and large-parameter networks in this example are equivalent (because one is simply a rescaling of the other), then any apparent differences in the loss function are merely an artifact of scale invariance.
  • Also, it is noted that filter normalization is not limited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer is equivalent to a Conv layer with a 1×1 output feature map and the filter corresponds to the weights that generate one neuron.

2. Results

2.1. Sharpness Correlates with Generalization Error

The 1D and 2D visualization of solutions obtained using SGD with different weight decay and batch size. The title of each subfigure contains the weight decay, batch size, and test error.

Sharpness means the minima has a flat (e and g) or sharp (f and h) shape or not. Although they all obtains minima, the sharpness can tell us whether the neural network is efficiently trainable or not, which will be described below.

  • The above figure shows the differences in sharpness between small batch and large batch minima.

By using the proposed method, we can make side-by-side comparisons between minimizers, and we see that now sharpness correlates well with generalization error. Large batches produced visually sharper minima with higher test error.

2.2. What Makes Neural Networks Trainable?

2D visualization of the loss surface of and -noshort with different depth.
  • Several s without/with shortcut connections are tested.
  • The network -20-noshort has a fairly benign landscape dominated by a region with convex contours in the center, and no dramatic non-convexity.

2.2.1. Network Depth

However, as network depth increases, the loss surface of the -like nets (i.e. without Shortcut) spontaneously transitions from (nearly) convex to chaotic.

  • -56-noshort has dramatic non-convexities and large regions where the gradient directions (which are normal to the contours depicted in the plots) do not point towards the minimizer at the center.
  • Also, the loss function becomes extremely large as we move in some directions. -110-noshort displays even more dramatic non-convexities, and becomes extremely steep.

2.2.2. Shortcut Connections

As shown in the top row of the above figure, residual connections prevent the transition to chaotic behavior as depth increases.

The loss surfaces of -110-noshort and for CIFAR-10.

Different types of residual connections, as in , also helps preventing the transition to chaotic behavior.

2.2.3. Network Width

-56 on CIFAR-10 both with shortcut connections (top) and without (bottom). The label k = 2 means twice as many filters per layer. Test error is reported below each figure.

Wider models have loss landscapes less chaotic behavior. Increased network width resulted in flat minima and wide regions of apparent convexity.

  • We see that increased width prevents chaotic behavior, and skip connections dramatically widen minimizers.
  • Finally, note that sharpness correlates extremely well with test error.

Later, , which uses the above visualization method for training better model.

Reference

[2018 NIPS] [Loss Landscape]

1.1. Image Classification

1989–2018 … [Loss Landscape] … 2022 [] [] [] [] [] [] [] [] [] [] [] 2023 []

1.12. Visualization

2002 [] 2006 [] [] 2007 [] 2008 [] 2018 [Loss Landscape]

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store