Brief Review — Visualizing the Loss Landscape of Neural Nets

Loss Landscape Visualization, Looking for Sharpness

Sik-Ho Tsang
4 min readJan 29, 2023
The loss surfaces of ResNet-56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures.

Visualizing the Loss Landscape of Neural Nets,
Loss Landscape, by University of Maryland, United States Naval Academy, and Cornell University,
2018 NIPS, Over 1200 Citations (Sik-Ho Tsang @ Medium)
Sharpness, Visualization

  • Filter-Wise Normalization is proposed to calculate the loss landscape for visualization.
  • Based on the visualization, we can know the importance of deeper layers, wider layers, and skip connection, and so on.

Outline

  1. Filter-Wise Normalization for Loss Landscape
  2. Results

1. Filter-Wise Normalization for Loss Landscape

  • Authors point out that some prior approaches are misleading or not good enough for visualization. (Please kindly read the paper directly if interested.) Thus, authors propose a new one here.
  • Assume we have parameters θ* for the network. This θ* is used as the center point in the graph.
  • Two random direction vectors, δ and η, are chosen.
  • Each sampled from a random Gaussian distribution with appropriate scaling:
  • where di,j represents the jth filter (not the jth weight) of the ith layer of d., and ||·|| denotes the Frobenius norm.
  • It is noted that neural nets are scale invariant; if the small-parameter and large-parameter networks in this example are equivalent (because one is simply a rescaling of the other), then any apparent differences in the loss function are merely an artifact of scale invariance.
  • Also, it is noted that filter normalization is not limited to convolutional (Conv) layers but also applies to fully connected (FC) layers. The FC layer is equivalent to a Conv layer with a 1×1 output feature map and the filter corresponds to the weights that generate one neuron.

2. Results

2.1. Sharpness Correlates with Generalization Error

The 1D and 2D visualization of solutions obtained using SGD with different weight decay and batch size. The title of each subfigure contains the weight decay, batch size, and test error.

Sharpness means the minima has a flat (e and g) or sharp (f and h) shape or not. Although they all obtains minima, the sharpness can tell us whether the neural network is efficiently trainable or not, which will be described below.

  • The above figure shows the differences in sharpness between small batch and large batch minima.

By using the proposed method, we can make side-by-side comparisons between minimizers, and we see that now sharpness correlates well with generalization error. Large batches produced visually sharper minima with higher test error.

2.2. What Makes Neural Networks Trainable?

2D visualization of the loss surface of ResNet and ResNet-noshort with different depth.
  • Several ResNets without/with shortcut connections are tested.
  • The network ResNet-20-noshort has a fairly benign landscape dominated by a region with convex contours in the center, and no dramatic non-convexity.

2.2.1. Network Depth

However, as network depth increases, the loss surface of the VGG-like nets (i.e. ResNet without Shortcut) spontaneously transitions from (nearly) convex to chaotic.

  • ResNet-56-noshort has dramatic non-convexities and large regions where the gradient directions (which are normal to the contours depicted in the plots) do not point towards the minimizer at the center.
  • Also, the loss function becomes extremely large as we move in some directions. ResNet-110-noshort displays even more dramatic non-convexities, and becomes extremely steep.

2.2.2. Shortcut Connections

As shown in the top row of the above figure, residual connections prevent the transition to chaotic behavior as depth increases.

The loss surfaces of ResNet-110-noshort and DenseNet for CIFAR-10.

Different types of residual connections, as in DenseNet, also helps preventing the transition to chaotic behavior.

2.2.3. Network Width

Wide-ResNet-56 on CIFAR-10 both with shortcut connections (top) and without (bottom). The label k = 2 means twice as many filters per layer. Test error is reported below each figure.

Wider models have loss landscapes less chaotic behavior. Increased network width resulted in flat minima and wide regions of apparent convexity.

  • We see that increased width prevents chaotic behavior, and skip connections dramatically widen minimizers.
  • Finally, note that sharpness correlates extremely well with test error.

Later, Sharpness-Aware Minimization (SAM) is proposed in 2021 ICLR, which uses the above visualization method for training better model.

Reference

[2018 NIPS] [Loss Landscape]
Visualizing the Loss Landscape of Neural Nets

1.1. Image Classification

1989–2018 … [Loss Landscape] … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] 2023 [Vision Permutator (ViP)]

1.12. Visualization

2002 [SNE] 2006 [Autoencoder] [DrLIM] 2007 [UNI-SNE] 2008 [t-SNE] 2018 [Loss Landscape]

==== My Other Previous Paper Readings ====

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.