The Singular Values of Convolutional Layers
2018, International Conference on Learning Representations
Abstract
We characterize the singular values of the linear transformation associated with a standard 2D multi-channel convolutional layer, enabling their efficient computation. This characterization also leads to an algorithm for projecting a convolutional layer onto an operator-norm ball. We show that this is an effective regularizer; for example, it improves the test error of a deep residual network using batch normalization on CIFAR-10 from 6.2% to 5.3%. 1 I Exploding and vanishing gradients (Hochreiter, 1991; Hochreiter et al., 2001; Goodfellow et al., 2016) are fundamental obstacles to effective training of deep neural networks. Many deep networks used in practice are layered. We can think of such networks as the composition of a number of feature transformations, followed by a linear classifier on the final layer of features. The singular values of the Jacobian of a layer bound the factor by which it increases or decreases the norm of the backpropagated signal. If these singular values are all close to 1, then gradients neither explode nor vanish. These singular values also bound these factors in the forward direction, which affects the stability of computations, including whether the network produces the dreaded "Nan". Moreover, it has been proven (Bartlett et al., 2017) that the generalization error for a network is bounded by the Lipschitz constant of the network, which in turn can be bounded by the product of the operator norms of the Jacobians of the layers. Cisse et al. (2017) discussed robustness to adversarial examples as a result of bounding the operator norm.
References (24)
- R P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In NIPS, pages 6240-6249, 2017.
- Adel Bibi, Bernard Ghanem, Vladlen Koltun, and Rene Ranftl. Deep layers as stochastic solvers. ICLR, 2019.
- S. Boyd and J. Dattorro. Alternating projections, 2003. https://web.stanford.edu/class/ee392o/alt_proj.pdf.
- J. P. Boyle and R. L. Dykstra. A method for finding projections onto the intersection of convex sets in hilbert spaces. In Advances in order restricted statistical inference, pages 28-47. Springer, 1986.
- C. Chao. A note on block circulant matrices. Kyungpook Mathematical Journal, 14:97-100, 1974.
- W. Cheney and A. A. Goldstein. Proximity maps for convex sets. Proceedings of the American Mathematical Society, 10(3):448-450, 1959. ISSN 00029939, 10886826. URL http://www.jstor.org/stable/ 2032864.
- M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. ICML, 2017.
- H. Drucker and Y. Le Cun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6):991-997, 1992.
- I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.
- H. Gouk, E. Frank, B. Pfahringer, and M. Cree. Regularisation of neural networks by enforcing lipschitz continuity. arXiv preprint arXiv:1804.04368, 2018a.
- H. Gouk, B. Pfahringer, E. Frank, and M. Cree. MaxGain: Regularisation of neural networks by constraining activation magnitudes. arXiv preprint arXiv:1804.05965, 2018b.
- R. M. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends® in Communications and Information Theory, 2(3):155-239, 2006.
- K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual net- works. In European Conference on Computer Vision, pages 630-645. Springer, 2016. http://download.tensorflow.org/models/official/resnet_v2_imagenet_checkpoint.tar.gz;
- M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In NIPS, pages 2266-2276, 2017.
- S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universität München, 91:1, 1991.
- S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
- R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. ISBN 0521548233, 9780521548236.
- A. K. Jain. Fundamentals of digital image processing. Englewood Cliffs, NJ: Prentice Hall" 1989.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
- S. Lefkimmiatis, J. P. Ward, and M. Unser. Hessian Schatten-norm regularization for linear inverse problems. IEEE transactions on image processing, 22(5):1873-1888, 2013.
- T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. ICLR, 2018.
- J. Pennington, S. Schoenholz, and S. Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Advances in neural information processing systems, pages 4788-4798, 2017.
- A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
- Y. Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.