Appearance of Random Matrix Theory in deep learning

rangarirai mbizi

doi:10.1016/J.PHYSA.2021.126742

Outline

Appearance of Random Matrix Theory in deep learning

rangarirai mbizi

2022, Physica A: Statistical Mechanics and its Applications

https://doi.org/10.1016/J.PHYSA.2021.126742

visibility

…

description

33 pages

link

1 file

Abstract

We investigate the local spectral statistics of the loss surface Hessians of artificial neural networks, where we discover agreement with Gaussian Orthogonal Ensemble statistics across several network architectures and datasets. These results shed new light on the applicability of Random Matrix Theory to modelling neural networks and suggest a role for it in the study of loss surfaces in deep learning.

Figures (14)

Figure 1: Comparison of different global spectral statistics (spectral densities). (a) We show actual GOE data to demonstrate the form of the Wigner semicircle. (b) Hessian of cross entropy loss for MLP on MNIST. (c) Hessian of cross entropy loss for logistic regression on MNIST. Note the log-scale on the y-axis. A few outliers have been clipped from logistic regression to aid visualisation.

Figure 2: The density of the Wigner surmise. not just the GOE or Wigner random matrices.

Figure 3: Spacing distributions for the Hessian of a logistic regression trained Resnet-34 embeddings of CIFAR10. Hessians computed over the test set. We display results as histograms of data along with a plot of the Wigner (or the Wigner-like) surmise density. We make a few practical adjustments to the plots. Spacing ratios are truncated above some value, as the presence of a few extreme outliers makes visualisation difficult. We choose a cut-off at 10. Note that around 0.985 of the mass of the Wigner-like surmise is below 10, so this is a reasonable adjustment. The hessians have degenerate spectra. The Wigner surmise is not a good fit to the observed unfolded spectra if the zero eigenvalues are retained. Imposing a lower cut-off of 10~?° in magnitude is sufficient to obtain agreement with Wigner.'? This is below the machine precision, so these omitted eigenvalues are indistinguishable from 0. 6.1. MNIST and MLPs

Figure 4: Spacing distributions for the Hessian of a 3-hidden-layer MLP trained on MNIST. Hessians computed over the test set. points, while spacing ratios were computed in batches and over the entire dataset. We observe a striking level of agreement between the observed spectra and the GO] FH. There was no discernible difference between the train and test conditions, nor between batch and full dataset conditions, nor between trained and untrained models. Note that the presence of GOE statistics for the untrained models is not a foregone conclusion. Of course, the weights of the model are indeed random Gaussian, but the Hessian is still a function of the data set, so it is not the case the Hessian eigenvalue statistics are bound to be GOE a pr iori. Overall, the very close agreement between Random Matrix Theory predictions and our observations for several different architectures, model sizes and datasets demonstrates a clear presence of RMT statistics in neural networks.

Figure 5: Spectral densities of Gaussian process Hessians with various kernel choices. All use k’(0) = 1. The dimension is 300 in all cases except (d), in which the Hessian is padded to 400 dimensions with zeros. All histograms are produced with 100 independent Hessian samples. « = 100 degenerate directions. + = 20 outliers

Figure 6: Consecutive spacing ratios of Gaussian process Hessians with various kernel choices. All use k’(0) = 1. The dimension is 300 in all cases except (d), in which the Hessian is padded to 400 dimensions with zeros. * = 100 degenerate directions. | = 20 outliers.

Figure 7: Spectral statistics for the Hessian of a CNN trained on CIFAR10. Hessians computed over batches of size 64 on the test set.

Figure 8: Spectral statistics for the Hessian of an MLP trained on the Bike dataset. Hessians computed over batches of size 64 on the test set.

Figure 9: Spectral statistics for the Gauss-Newton matrix of an MLP trained on the Bike dataset. Matrices computed over batches of size 64 on the test set. 7. Conclusion and future work

Figure A.10: Unfolded spacings for the Hessian of a logistic regression trained on MNIST. Hessian computed batches of size 64 of the training and test datasets. Figure A.14 compares the effect of degeneracy on unfolded spacings in each of the 3 cases considered. We see that the logistic MNIST models (trained and untrained) have a much greater level of degeneracy, whereas the CIFAR10-Resnet34 spectra clearly have GO] F} spacings even without any cut-off. Figures A.11—A.13 show further unfolded spacing and spacing ratio results like those in the main text.

Figure A.11: Consecutive spacing ratios for the Hessian of a logistic regression trained on MNIST. Hessian computed batches of size 64 of the training and test sets, and over the whole train and test sets.

Figure A.12: Unfolded spacings for the Hessian of a randomly initialised logistic regression for MNIST. Hessian computed batches of size 64 of the training and test datasets.

Figure A.13: Consecutive spacing ratios for the Hessian of a randomly initialised logistic regression for MNIST. Hessian computed batches of size 64 of the training and test sets, and over the whole train and test sets.

References (86)

Sherif M Abuelenin and Adel Y Abul-Magd. Effect of unfolding on the spectral statistics of adjacency matrices of complex networks. Procedia Computer Science, 12:69-74, 2012.
Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74-84. PMLR, 2020.
Robert J Adler and Jonathan E Taylor. Random fields and geometry. Springer Science & Business Media, 2009.
Daniel J Amit, Hanoch Gutfreund, and Haim Sompolinsky. Spin-glass models of neural networks. Physical Review A, 32(2):1007, 1985.
YY Atas, E Bogomolny, O Giraud, and G Roux. Distribution of the ratio of consecutive level spacings in random matrix ensembles. Physical review letters, 110(8):084101, 2013.
Antonio Auffinger, Gérard Ben Arous, and Jiří Černỳ. Random matrices and complexity of spin glasses. Communications on Pure and Applied Mathematics, 66(2):165-201, 2013.
Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Denny Wu, and Tianzong Zhang. Generalization of two-layer neural net-works: An asymptotic viewpoint. risk, 1(1.5):2-0, 2020.
Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoen- holz, Jascha Sohl-Dickstein, and Surya Ganguli. Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 2020.
Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. The loss surfaces of neural networks with general activation functions. Journal of Statistical Mechanics: Theory and Experiment, 2021(6):064001, 2021.
Nicholas P Baskerville, Jonathan P Keating, Francesco Mezzadri, and Joseph Najnudel. A spin-glass model for the loss surfaces of generative adversarial networks. arXiv preprint arXiv:2101.02524, 2021.
Carlo WJ Beenakker. Random-matrix theory of quantum transport. Reviews of modern physics, 69(3):731, 1997.
Lucas Benigni and Sandrine Péché. Eigenvalue distribution of nonlinear models of random matrices. arXiv preprint arXiv:1904.03090, 2019.
Michael V Berry and Marko Robnik. Semiclassical level spacings when regular and chaotic orbits coexist. Journal of Physics A: Mathematical and General, 17(12):2413, 1984.
Michael V Berry et al. Quantum chaology. Proc. Roy. Soc. London A, 413:183-198, 1987.
Michael Victor Berry and Michael Tabor. Level clustering in the regular spectrum. Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences, 356(1686):375-394, 1977.
Oriol Bohigas. Random matrix theories and chaotic dynamics. Technical report, Paris-11 Univ., 1991.
Léon Bottou. Stochastic Gradient Descent Tricks, pages 421-436. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642- 35289-8. doi: 10.1007/978-3-642-35289-8 25. URL https://doi.org/ 10.1007/978-3-642-35289-8_25.
Joël Bun, Jean-Philippe Bouchaud, and Marc Potters. Cleaning large correlation matrices: tools from random matrix theory. Physics Reports, 666:1-109, 2017.
Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning over- parameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675, 2019.
Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Laurent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborová. Machine learning and the physical sciences. Reviews of Modern Physics, 91(4):045002, 2019.
Pratik Chaudhari and Stefano Soatto. On the energy landscape of deep networks. arXiv preprint arXiv:1511.06485, 2015.
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192-204, 2015.
Anna Choromanska, Yann LeCun, and Gérard Ben Arous. Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756-1760, 2015.
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933-2941, 2014.
Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141-142, 2012.
Oussama Dhifallah and Yue M Lu. A precise performance analysis of learning with random features. arXiv preprint arXiv:2008.11904, 2020.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121-2159, 2011.
Konstantin Efetov. Supersymmetry in disorder and chaos. Cambridge university press, 1999.
Yan V Fyodorov. Complexity of random energy landscapes, glass transi- tion, and absolute value of the spectral determinant of random matrices. Physical review letters, 92(24):240601, 2004.
Yan V Fyodorov and Pierre Le Doussal. Topology trivialization and large deviations for the minimum in the simplest random optimization. Journal of Statistical Physics, 154(1):466-490, 2014.
Yan V Fyodorov and Pierre Le Doussal. Hessian spectrum at the global minimum of high-dimensional random landscapes. Journal of Physics A: Mathematical and Theoretical, 51(47):474002, 2018.
Yan V Fyodorov and Ian Williams. Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity. Journal of Statistical Physics, 129(5-6):1081-1116, 2007.
Marylou Gabrié. Mean-field inference methods for neural networks. Journal of Physics A: Mathematical and Theoretical, 53(22):223002, 2020.
Elizabeth Gardner and Bernard Derrida. Optimal storage properties of neural network models. Journal of Physics A: Mathematical and general, 21(1):271, 1988.
Federica Gerace, Bruno Loureiro, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452-3462. PMLR, 2020.
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via Hessian eigenvalue density. arXiv preprint arXiv:1901.10159, 2019.
Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. The gaussian equivalence of genera- tive models for learning with shallow neural networks. arXiv preprint arXiv:2006.14709, 2020.
Diego Granziol. Beyond random matrix theory for deep networks. arXiv preprint arXiv:2006.07721, 2020.
Diego Granziol, Timur Garipov, Dmitry Vetrov, Stefan Zohren, Stephen Roberts, and Andrew Gordon Wilson. Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods. 2019.
Diego Granziol, Xingchen Wan, Timur Garipov, Dmitry Vetrov, and Stephen Roberts. MLRG deep curvature. arXiv preprint arXiv:1912.09656, 2019.
Diego Granziol, Timur Garipov, Dmitry Vetrov, Stefan Zohren, Stephen Roberts, and Andrew Gordon Wilson. Towards understanding the true loss surface of deep neural networks using random matrix the- ory and iterative spectral methods. https://openreview.net/forum? id=H1gza2NtwH, 2020.
Thomas Guhr, Axel Müller-Groeling, and Hans A Weidenmüller. Random-matrix theories in quantum physics: common concepts. Physics Reports, 299(4-6):189-425, 1998.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026-1034, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Yann LeCun. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Meng Heng Loke and Torleif Dahlin. A comparison of the gauss-newton and quasi-newton methods in resistivity imaging inversion. Journal of applied geophysics, 49(3):149-162, 2002.
Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, and Lenka Zdeborová. Capturing the learning curves of generic features maps for realistic data sets with a teacher- student model. arXiv preprint arXiv:2102.08127, 2021.
Antoine Maillard, Gérard Ben Arous, and Giulio Biroli. Landscape com- plexity for the empirical risk of generalized linear models. In Mathematical and Scientific Machine Learning, pages 287-327. PMLR, 2020.
James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pages 2408-2417, 2015.
James Martens and Ilya Sutskever. Training deep and recurrent networks with Hessian-free optimization. In Neural networks: Tricks of the trade, pages 479-535. Springer, 2012.
Madan Lal Mehta. Random matrices. Elsevier, 2004.
Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 2019.
Gérard Meurant and Zdeněk Strakoš. The Lanczos and conjugate gradient algorithms in finite precision arithmetic. Acta Numerica, 15:471-542, 2006.
Marc Mezard and Andrea Montanari. Information, physics, and compu- tation. Oxford University Press, 2009.
Vardan Papyan. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062, 2018.
Vardan Papyan. Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet hessians. arXiv preprint arXiv:1901.08244, 2019.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in Pytorch. 2017.
Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural computation, 6(1):147-160, 1994.
Jeffrey Pennington and Yasaman Bahri. Geometry of neural network loss surfaces via random matrix theory. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2798- 2806. JMLR. org, 2017.
Jeffrey Pennington and Pratik Worah. The spectrum of the fisher information matrix of a single-hidden-layer neural network. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5410-5419. Curran Associates, Inc., 2018.
Jeffrey Pennington and Pratik Worah. Nonlinear random matrix the- ory for deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124005, 2019.
PyTorch. Resnet. https://pytorch.org/hub/pytorch_vision_ resnet, 2021. Accessed: 2021-05-03.
Daniel A Roberts, Sho Yaida, and Boris Hanin. The principles of deep learning theory. arXiv preprint arXiv:2106.10165, 2021.
Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy landscapes in spiked-tensor and simple glassy mod- els: Ruggedness, arrangements of local minima, and phase transitions. Physical Review X, 9(1):011003, 2019.
Levent Sagun, V Ugur Guney, Gerard Ben Arous, and Yann Le- Cun. Explorations on high dimensional landscapes. arXiv preprint arXiv:1412.6615, 2014.
Levent Sagun, Léon Bottou, and Yann LeCun. Eigenvalues of the Hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the Hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Flo- rent Krzakala, and Lenka Zdeborová. Who is afraid of big bad minima? analysis of gradient-flow in spiked matrix-tensor mod- els.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché- Buc, E. Fox, and R. Garnett, editors, Advances in Neural In- formation Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ fbad540b2f3b5638a9be9aa6a4d8e450-Paper.pdf.
Torsten Scholak. unfoldr. https://github.com/tscholak/unfoldr, 2015. Accessed: 2020-10-30.
Torsten Scholak, Thomas Wellens, and Andreas Buchleitner. Spectral backbone of excitation transport in ultracold rydberg gases. Physical Review A, 90(6):063415, 2014.
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806-813, 2014.
Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Soc., 2012.
Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Wein- berger, and Andrew Gordon Wilson. Exact gaussian processes on a million data points. Advances in Neural Information Processing Systems, 32:14648-14659, 2019.
HA Weidenmuller and GE Mitchell. Random matrices and chaos in nuclear physics. arXiv preprint arXiv:0807.1070, 2008.
Lenka Zdeborová and Florent Krzakala. Statistical physics of inference: Thresholds and algorithms. Advances in Physics, 65(5):453-552, 2016.
Appendix B. Experimental details 3. 100 neurons to 100 neurons.
100 neurons to 10 output logits. Logistic regression on ResNet features (CIFAR10)
Fully connection layer from 400 to 120.
Fully connection layer from 120 to 84.
Fully connection layer from 84 to output 10 logits. MLP (CIFAR10)
50 neurons to 1 regression output.

Appearance of Random Matrix Theory in deep learning

Sign up for access to the world's latest research

Abstract

Related papers

References (86)

Related papers

Related topics