Academia.eduAcademia.edu

Outline

Neural Networks for Time-Series Forecasting

2012, Handbook of Natural Computing

https://doi.org/10.1007/978-3-540-92910-9_14

Abstract

In this paper we study the generalization capabilities of fully-connected neural networks trained in the context of time series forecasting. Time series do not satisfy the typical assumption in statistical learning theory of the data being i.i.d. samples from some data-generating distribution. We use the input and weight Hessians, that is the smoothness of the learned function with respect to the input and the width of the minimum in weight space, to quantify a network's ability to generalize to unseen data. While such generalization metrics have been studied extensively in the i.i.d. setting of for example image recognition, here we empirically validate their use in the task of time series forecasting. Furthermore we discuss how one can control the generalization capability of the network by means of the training process using the learning rate, batch size and the number of training iterations as controls. Using these hyperparameters one can efficiently control the complexity of the output function without imposing explicit constraints.

References (38)

  1. A. Achille, S. Soatto, Emergence of invariance and disentanglement in deep representations, J. Mach. Learn. Res. 19 (2018) 1-34.
  2. A. Agarwal, J.C. Duchi, The generalization ability of online algorithms for dependent data, IEEE Trans. Inform. Theory 59 (2013) 573-587.
  3. A. Auffinger, G.B. Arous, J. Čern ý, Random matrices and complexity of spin glasses, Commun. Pure Appl. Math. 66 (2013) 165-201.
  4. S. Becker, Y. Zhang, A.A. Lee, Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks, 2018 arXiv:1808.00408.
  5. J.L. Bernier, J. Ortega, E. Ros, I. Rojas, A. Prieto, A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs, Neural Comput. 12 (2000) 2941-2964.
  6. C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 1st ed. 2006. corr. 2nd printing ed., Springer, New York, 2007.
  7. A.J. Bray, D.S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Phys. Rev. Lett. 98 (2007) 150201.
  8. P. Chaudhari, S. Soatto, Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, in: 2018 Information Theory and Applications Workshop (ITA), IEEE, 2018, pp. 1-10.
  9. A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, Y. LeCun, The Loss Surfaces of Multilayer Networks, 2015, pp. 192-204.
  10. R. Cont, Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues, 2001.
  11. Y.N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in Neural Information Processing Systems (2014) 2933-2941.
  12. L. Dinh, R. Pascanu, S. Bengio, Y. Bengio, Sharp Minima Can Generalize for Deep Nets, 2017 arXiv:1703.04933.
  13. G.K. Dziugaite, D.M. Roy, Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data, 2017 arXiv:1703.11008.
  14. Y.V. Fyodorov, I. Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, J. Stat. Phys. 129 (2007) 1081-1116.
  15. R. Geirhos, C.R. Temme, J. Rauber, H.H. Schütt, M. Bethge, F.A. Wichmann, Generalisation in humans and deep neural networks, Advances in Neural Information Processing Systems (2018) 7549-7561.
  16. S. Hochreiter, J. Schmidhuber, Flat minima, Neural Comput. 9 (1997) 1-42.
  17. S. Jastrzkebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, A. Storkey, Three Factors Influencing Minima in SGD, 2017 arXiv:1711.04623.
  18. V. Kuznetsov, M. Mohri, Generalization bounds for time series prediction with non-stationary processes, in: International Conference on Algorithmic Learning Theory, Springer, 2014, pp. 260-274.
  19. V. Kuznetsov, M. Mohri, Generalization bounds for non-stationary mixing processes, Mach. Learn. 106 (2017) 93-117.
  20. V. Kuznetsov, M. Mohri, Theory and Algorithms for Forecasting Time Series, 2018 arXiv:1803.05814.
  21. H. Li, Z. Xu, G. Taylor, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, NIPS, 2018.
  22. S. Mandt, M.D. Hoffman, D.M. Blei, Stochastic Gradient Descent as Approximate Bayesian Inference, 2017 arXiv:1704.04289.
  23. J. Martens, Deep learning via Hessian-free optimization, ICML, vol. 27 (2010) 735-742.
  24. D.J. McDonald, C.R. Shalizi, M. Schervish, Nonparametric risk bounds for time-series forecasting, J. Mach. Learn. Res. 18 (2017) 1-40.
  25. R. Novak, Y. Bahri, D.A. Abolafia, J. Pennington, J. Sohl-Dickstein, Sensitivity and Generalization in Neural Networks: An Empirical Study, 2018 arXiv:1802.08760.
  26. D.S. Park, J. Sohl-Dickstein, Q.V. Le, S.L. Smith, The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study, 2019 arXiv:1905.03776.
  27. J. Pennington, Y. Bahri, Geometry of neural network loss surfaces via random matrix theory, International Conference on Machine Learning (2017) 2798-2806.
  28. R. Reed, S. Oh, R. Marks, Regularization using jittered training data, in: International Joint Conference on Neural Networks, 1992. IJCNN, vol. 3, IEEE, 1992, pp. 147-152.
  29. S. Seong, Y. Lee, Y. Kee, D. Han, J. Kim, Towards Flatter Loss Surface Via Nonmonotonic Learning Rate Scheduling, UAI, 2018.
  30. R. Shwartz-Ziv, N. Tishby, Opening the Black Box of Deep Neural Networks Via Information, 2017 arXiv:1703.00810.
  31. L.N. Smith, N. Topin, Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates, 2017 arXiv:1708.07120.
  32. S.L. Smith, Q.V. Le, A Bayesian Perspective on Generalization and Stochastic Gradient Descent, 2018.
  33. J. Sokoli ć, R. Giryes, G. Sapiro, M.R. Rodrigues, Robust large margin deep neural networks, IEEE Trans. Signal Process. 65 (2017) 4265-4280.
  34. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR, 2014, pp. 1929-1958.
  35. N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in: 2015 IEEE Information Theory Workshop (ITW), IEEE, 2015, pp. 1-5.
  36. C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning Requires Rethinking Generalization, 2016 arXiv:1611.03530.
  37. G. Zhang, B.E. Patuwo, M.Y. Hu, Forecasting with artificial neural networks: the state of the art, Int. J. Forecast. 14 (1998) 35-62.
  38. W. Zhou, V. Veitch, M. Austern, R.P. Adams, P. Orbanz, Compressibility and Generalization in Large-Scale Deep Learning, 2018 arXiv:1804.05862.