Academia.eduAcademia.edu

Outline

Convex Duality of Deep Neural Networks

2020, arXiv (Cornell University)

Abstract

We study regularized deep neural networks and introduce an analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weight matrices for a norm regularized deep neural network training problem can be explicitly found as the extreme points of a convex set. For two-layer linear networks, we first formulate a convex dual program and prove that strong duality holds. We then extend our derivations to prove that strong duality also holds for certain deep networks. In particular, for linear deep networks, we show that each optimal layer weight matrix is rank-one and aligns with the previous layers when the network output is scalar. We also extend our analysis to the vector outputs and other convex loss functions. More importantly, we show that the same characterization can also be applied to deep ReLU networks with rank-one inputs, where we prove that strong duality still holds and optimal layer weight matrices are rank-one for scalar output networks. As a corollary, we prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. We then verify our theoretical results via several numerical experiments.

References (21)

  1. Arora, S., Cohen, N., Golowich, N., and Hu, W. A con- vergence analysis of gradient descent for deep linear neural networks. CoRR, abs/1810.02281, 2018a. URL http://arxiv.org/abs/1810.02281.
  2. Arora, S., Cohen, N., and Hazan, E. On the optimization of deep networks: Implicit acceleration by overparame- terization. arXiv preprint arXiv:1802.06509, 2018b.
  3. Arora, S., Cohen, N., Hu, W., and Luo, Y. Implicit reg- ularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
  4. Bach, F. Breaking the curse of dimensionality with con- vex neural networks. The Journal of Machine Learning Research, 18(1):629-681, 2017.
  5. Bhojanapalli, S., Neyshabur, B., and Srebro, N. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pp. 3873-3881, 2016.
  6. Boyd, S. and Vandenberghe, L. Convex optimization. Cam- bridge university press, 2004.
  7. Chizat, L. and Bach, F. A note on lazy training in su- pervised differentiable programming. arXiv preprint arXiv:1812.07956, 2018.
  8. Du, S. and Hu, W. Width provably matters in optimization for deep linear neural networks. In International Con- ference on Machine Learning, pp. 1655-1664, 2019.
  9. Du, S. S., Hu, W., and Lee, J. D. Algorithmic regularization in learning deep homogeneous models: Layers are auto- matically balanced. In Advances in Neural Information Processing Systems, pp. 384-395, 2018.
  10. Goberna, M. A. and López-Cerdá, M. Linear semi- infinite optimization. 01 1998. doi: 10.1007/ 978-1-4899-8044-1 3.
  11. Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6151-6159, 2017.
  12. Gunasekar, S., Lee, J. D., Soudry, D., and Srebro, N. Im- plicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Process- ing Systems, pp. 9461-9471, 2018.
  13. Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HJflg30qKX. Laurent, T. and Brecht, J. Deep linear networks with arbi- trary loss: All local minima are global. In International Conference on Machine Learning, pp. 2902-2907, 2018.
  14. Maennel, H., Bousquet, O., and Gelly, S. Gradient de- scent quantizes relu network features. arXiv preprint arXiv:1803.08367, 2018.
  15. Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  16. Parhi, R. and Nowak, R. D. Minimum "norm" neural net- works are splines. arXiv preprint arXiv:1910.02333, 2019.
  17. Rosset, S., Swirszcz, G., Srebro, N., and Zhu, J. L1 regular- ization in infinite dimensional feature spaces. In Interna- tional Conference on Computational Learning Theory, pp. 544-558. Springer, 2007.
  18. Savarese, P., Evron, I., Soudry, D., and Srebro, N. How do infinite width bounded norm networks look in func- tion space? CoRR, abs/1902.05040, 2019. URL http://arxiv.org/abs/1902.05040.
  19. Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
  20. Shamir, O. Exponential convergence time of gradient de- scent for one-dimensional deep linear neural networks. arXiv preprint arXiv:1809.08587, 2018.
  21. Wei, C., Lee, J. D., Liu, Q., and Ma, T. On the margin theory of feedforward neural networks. arXiv preprint arXiv:1810.05369, 2018.