Convex neural networks
2006, Advances in neural …
Abstract
Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. s(a) = 1 1+e −a . A learning algorithm must specify how to select m, the w i 's and the v i 's.
Key takeaways
AI
AI
- Training multi-layer neural networks can be framed as a convex optimization problem.
- The proposed algorithm incrementally adds hidden units, solving a linear classification problem at each step.
- Using L1 regularization facilitates obtaining a finite solution with a limited number of active hidden units.
- The global optimum can be verified using a stopping criterion based on weighted error minimization.
- Experiments show that more hidden units reduce the likelihood of stalling in optimization, improving convergence.
References (9)
- Chvátal, V. (1983). Linear Programming. W.H. Freeman.
- Delalleau, O., Bengio, Y., and Le Roux, N. (2005). Efficient non-parametric function induction in semi-supervised learning. In Cowell, R. and Ghahramani, Z., editors, Proceedings of AIS- TATS'2005, pages 96-103.
- Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Science, 55(1):119-139.
- Friedman, J. (2001). Greedy function approximation: a gradient boosting machine. Annals of Statis- tics, 29:1180.
- Hettich, R. and Kortanek, K. (1993). Semi-infinite programming: theory, methods, and applications. SIAM Review, 35(3):380-429.
- Marcotte, P. and Savard, G. (1992). Novel approaches to the discrimination problem. Zeitschrift fr Operations Research (Theory), 36:517-545.
- Mason, L., Baxter, J., Bartlett, P. L., and Frean, M. (2000). Boosting algorithms as gradient descent. In Advances in Neural Information Processing Systems 12, pages 512-518.
- Rätsch, G., Demiriz, A., and Bennett, K. P. (2002). Sparse regression ensembles in infinite and finite hypothesis spaces. Machine Learning.
- Rumelhart, D., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323:533-536.