Neural Networks for Time-Series Forecasting

William Remus

doi:10.1007/978-3-540-92910-9_14

Outline

Neural Networks for Time-Series Forecasting

William Remus

2012, Handbook of Natural Computing

https://doi.org/10.1007/978-3-540-92910-9_14

visibility

…

description

15 pages

link

1 file

Abstract

In this paper we study the generalization capabilities of fully-connected neural networks trained in the context of time series forecasting. Time series do not satisfy the typical assumption in statistical learning theory of the data being i.i.d. samples from some data-generating distribution. We use the input and weight Hessians, that is the smoothness of the learned function with respect to the input and the width of the minimum in weight space, to quantify a network's ability to generalize to unseen data. While such generalization metrics have been studied extensively in the i.i.d. setting of for example image recognition, here we empirically validate their use in the task of time series forecasting. Furthermore we discuss how one can control the generalization capability of the network by means of the training process using the learning rate, batch size and the number of training iterations as controls. Using these hyperparameters one can efficiently control the complexity of the output function without imposing explicit constraints.

References (38)

A. Achille, S. Soatto, Emergence of invariance and disentanglement in deep representations, J. Mach. Learn. Res. 19 (2018) 1-34.
A. Agarwal, J.C. Duchi, The generalization ability of online algorithms for dependent data, IEEE Trans. Inform. Theory 59 (2013) 573-587.
A. Auffinger, G.B. Arous, J. Čern ý, Random matrices and complexity of spin glasses, Commun. Pure Appl. Math. 66 (2013) 165-201.
S. Becker, Y. Zhang, A.A. Lee, Geometry of Energy Landscapes and the Optimizability of Deep Neural Networks, 2018 arXiv:1808.00408.
J.L. Bernier, J. Ortega, E. Ros, I. Rojas, A. Prieto, A quantitative study of fault tolerance, noise immunity, and generalization ability of MLPs, Neural Comput. 12 (2000) 2941-2964.
C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), 1st ed. 2006. corr. 2nd printing ed., Springer, New York, 2007.
A.J. Bray, D.S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Phys. Rev. Lett. 98 (2007) 150201.
P. Chaudhari, S. Soatto, Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, in: 2018 Information Theory and Applications Workshop (ITA), IEEE, 2018, pp. 1-10.
A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, Y. LeCun, The Loss Surfaces of Multilayer Networks, 2015, pp. 192-204.
R. Cont, Empirical Properties of Asset Returns: Stylized Facts and Statistical Issues, 2001.
Y.N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Advances in Neural Information Processing Systems (2014) 2933-2941.
L. Dinh, R. Pascanu, S. Bengio, Y. Bengio, Sharp Minima Can Generalize for Deep Nets, 2017 arXiv:1703.04933.
G.K. Dziugaite, D.M. Roy, Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data, 2017 arXiv:1703.11008.
Y.V. Fyodorov, I. Williams, Replica symmetry breaking condition exposed by random matrix calculation of landscape complexity, J. Stat. Phys. 129 (2007) 1081-1116.
R. Geirhos, C.R. Temme, J. Rauber, H.H. Schütt, M. Bethge, F.A. Wichmann, Generalisation in humans and deep neural networks, Advances in Neural Information Processing Systems (2018) 7549-7561.
S. Hochreiter, J. Schmidhuber, Flat minima, Neural Comput. 9 (1997) 1-42.
S. Jastrzkebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, A. Storkey, Three Factors Influencing Minima in SGD, 2017 arXiv:1711.04623.
V. Kuznetsov, M. Mohri, Generalization bounds for time series prediction with non-stationary processes, in: International Conference on Algorithmic Learning Theory, Springer, 2014, pp. 260-274.
V. Kuznetsov, M. Mohri, Generalization bounds for non-stationary mixing processes, Mach. Learn. 106 (2017) 93-117.
V. Kuznetsov, M. Mohri, Theory and Algorithms for Forecasting Time Series, 2018 arXiv:1803.05814.
H. Li, Z. Xu, G. Taylor, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, NIPS, 2018.
S. Mandt, M.D. Hoffman, D.M. Blei, Stochastic Gradient Descent as Approximate Bayesian Inference, 2017 arXiv:1704.04289.
J. Martens, Deep learning via Hessian-free optimization, ICML, vol. 27 (2010) 735-742.
D.J. McDonald, C.R. Shalizi, M. Schervish, Nonparametric risk bounds for time-series forecasting, J. Mach. Learn. Res. 18 (2017) 1-40.
R. Novak, Y. Bahri, D.A. Abolafia, J. Pennington, J. Sohl-Dickstein, Sensitivity and Generalization in Neural Networks: An Empirical Study, 2018 arXiv:1802.08760.
D.S. Park, J. Sohl-Dickstein, Q.V. Le, S.L. Smith, The Effect of Network Width on Stochastic Gradient Descent and Generalization: An Empirical Study, 2019 arXiv:1905.03776.
J. Pennington, Y. Bahri, Geometry of neural network loss surfaces via random matrix theory, International Conference on Machine Learning (2017) 2798-2806.
R. Reed, S. Oh, R. Marks, Regularization using jittered training data, in: International Joint Conference on Neural Networks, 1992. IJCNN, vol. 3, IEEE, 1992, pp. 147-152.
S. Seong, Y. Lee, Y. Kee, D. Han, J. Kim, Towards Flatter Loss Surface Via Nonmonotonic Learning Rate Scheduling, UAI, 2018.
R. Shwartz-Ziv, N. Tishby, Opening the Black Box of Deep Neural Networks Via Information, 2017 arXiv:1703.00810.
L.N. Smith, N. Topin, Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates, 2017 arXiv:1708.07120.
S.L. Smith, Q.V. Le, A Bayesian Perspective on Generalization and Stochastic Gradient Descent, 2018.
J. Sokoli ć, R. Giryes, G. Sapiro, M.R. Rodrigues, Robust large margin deep neural networks, IEEE Trans. Signal Process. 65 (2017) 4265-4280.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR, 2014, pp. 1929-1958.
N. Tishby, N. Zaslavsky, Deep learning and the information bottleneck principle, in: 2015 IEEE Information Theory Workshop (ITW), IEEE, 2015, pp. 1-5.
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, Understanding Deep Learning Requires Rethinking Generalization, 2016 arXiv:1611.03530.
G. Zhang, B.E. Patuwo, M.Y. Hu, Forecasting with artificial neural networks: the state of the art, Int. J. Forecast. 14 (1998) 35-62.
W. Zhou, V. Veitch, M. Austern, R.P. Adams, P. Orbanz, Compressibility and Generalization in Large-Scale Deep Learning, 2018 arXiv:1804.05862.

Nowadays, in the age of big data and more data generation, there is a growing need to store and process large-scale data in real-time which has led to the deployment of cloud computing. The significant growth of the DC market has led to its rapid growth of power consumption as well as cost. By 2025, the DC market is predicted to account Abstract Nowadays, the fast rate of technological advances, such as cloud computing, has led to the rapid growth of the Data Center (DC) market as well as their power consumption. Therefore, DC power management has become increasingly important. While power forecasting can greatly help DC power management and reduce energy consumption and cost. Power forecasting predicts the potential energy failures or sudden fluctuations in power intake from utility grid. However, it is hard especially when variable renewable energies (RE) as well as free cooling such as air economizers are also used. Geo-distributed DCs face an even harder issue: since in addition to local conditions, the overall status of the entire system of collaborating DCs should also be considered. The conventional approach to forecast power consumption in such complicated cases is to construct a closed form formula for power. This is a tedious task that not only needs expert knowledge of how each single cooling or RE system works, but also needs to be done individually for each DC and repeated all over again for each new DC or change of equipment. One alternative is to use machine learning so as to learn over time how the system consumes power in varying conditions of weather, workload, and internal structure in multiple geo-distributed locations. However, due to the wide range of effective features as well as trade-off between the accuracy and processing overhead, one important issue is to obtain an optimal set of more influential features. In this study, we analyze the correlation among geo-distributed DC power patterns with their weather parameters (based on different DC situations and infrastructure) and extract a set of influential features. Afterward, we apply the obtained features to provide a power consumption forecasting model that predict the power pattern of each collaborating DC in a cloud. Our experimental results show that the proposed prediction model for geo-distributed DCs reaches the accuracy of 87.2%.

Neural Networks for Time-Series Forecasting

Sign up for access to the world's latest research

Abstract

Related papers

References (38)

Related papers

Related topics

Cited by