Academia.eduAcademia.edu

Outline

Convex neural networks

2006, Advances in neural …

Abstract

Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. s(a) = 1 1+e −a . A learning algorithm must specify how to select m, the w i 's and the v i 's.

Key takeaways
sparkles

AI

  1. Training multi-layer neural networks can be framed as a convex optimization problem.
  2. The proposed algorithm incrementally adds hidden units, solving a linear classification problem at each step.
  3. Using L1 regularization facilitates obtaining a finite solution with a limited number of active hidden units.
  4. The global optimum can be verified using a stopping criterion based on weighted error minimization.
  5. Experiments show that more hidden units reduce the likelihood of stalling in optimization, improving convergence.

References (9)

  1. Chvátal, V. (1983). Linear Programming. W.H. Freeman.
  2. Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors, Proceedings of AIS- TATS'2005, pages 96-103.
  3. Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Science, 55(1):119-139.
  4. Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statis- tics, 29:1180.
  5. Hettich, R. and Kortanek, K. (1993). Semi-infinite programming: theory, methods, and applications. SIAM Review, 35(3):380-429.
  6. Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem. Zeitschrift fr Operations Research (Theory), 36:517-545.
  7. Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12, pages 512-518.
  8. Rätsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning.
  9. Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323:533-536.