Symmetry of backpropagation and chain rule
2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290)
https://doi.org/10.1109/IJCNN.2002.1005528…
5 pages
1 file
Sign up for access to the world's latest research
Abstract
Gradient backpropagation, understood as a method of computing derivatives of composite functions, is commonly understood as a version of the chain rule. We show that this is not true, and both methods are in a sense opposite. While for the chain rule one needs derivatives with respect to all variables that influence a given intermediate variable, the backpropagation calls for derivatives of all variables that are influenced by the present variable. Knowing this, the derivation of the gradient for even complicated neural networks is almost trivial. In a matrix form, both methods differ in the order of matrix multiplication. Use of the chain rule is almost automatic since we all know it from math analysis education; use of the backpropagation could be as automatic if introduced in university math education as an equivalent alternative version of derivative calculation for composite functions.
Related papers
Nonlinear Analysis: Theory, Methods & Applications, 1997
Backprop is the primary learning algorithm used in many machine learning algorithms. In practice, however, Backprop in deep neural networks is a highly sensitive learning algorithm and successful learning depends on numerous conditions and constraints. One set of constraints is to avoid weights that lead to saturated units. The motivation for avoiding unit saturation is that gradients vanish and as a result learning comes to a halt. Careful weight initialization and re-scaling schemes such as batch normalization ensure that input activity to the neuron is within the linear regime where gradients are not vanished and can flow. Here we investigate backpropagating error terms only linearly. That is, we ignore the saturation that arise by ensuring gradients always flow. We refer to this learning rule as Linear Backprop since in the backward pass the network appears to be linear. In addition to ensuring persistent gradient flow, Linear Backprop is also favorable when computation is expensive since gradients are never computed. Our early results suggest that learning with Linear Backprop is competitive with Backprop and saves expensive gradient computations.
IEEE Access
In this work, we propose an artificial neural network topology to estimate the derivative of a function. This topology is called a differential neural network because it allows the estimation of the derivative of any of the network outputs with respect to any of its inputs. The main advantage of a differential neural network is that it uses some of the weights of a multilayer neural network. Therefore, a differential neural network does not need to be trained. First, a multilayer neural network is trained to find the best set of weights that minimize an error function. Second, the weights of the trained network and its neuron activations are used to build a differential neural network. Consequently, a multilayer artificial neural can produce a specific output, and simultaneously, estimate the derivative of any of its outputs with respect to any of its inputs. Several computer simulations were carried out to validate the performance of the proposed method. The computer simulation results showed that differential neural networks are capable of estimating with good accuracy the derivative of a function. The method was developed for an artificial neural network with two layers; however, the method can be extended to more than two layers. Similarly, the analysis in this study is presented for two common activation functions. Nonetheless, other activation functions can be used as long as the derivative of the activation function can be computed. INDEX TERMS Differential neural network, artificial intelligence, neural network structure, derivative estimation, multilayer network.
Conference Record Southcon, 1994
This paper reviews a formalism that enables the dynamics of a broad class of neural networks to be understood. This formalism is then applied to a specific network and the predicted and simulated behavior of the system are compared. A number of previous works have analysed the Lyapunov stability of neural network models. This type of analysis shows that the excursion of the solutions from a stable point is bounded. The purpose of this work is to review and then utili&e a model of the dynamics that also describes the phase space behavior and structural stability of the system. This is achieved by writing the general equations of the neural network dynamics as a gradient-like system. In this paper it is demonstrated that a network with additive activation dynamics and Hebbian weight update dynamics can be expressed tw a gradient-like system. An example of a 3-layer network with feedback between adjacent layers is presented. It is shown that the process of weight learning is stable in this network when the learned weights are symmetric. Furthermore, the weight learning process is stable when the learned weights are asymmetric, provided that the activation is computed using only the symmetric part of the weights.
2015
In this paper we explore different strategies to guide backpropagation algorithm used for training artificial neural networks. Two different variants of steepest descent-based backpropagation algorithm, and four dif-ferent variants of conjugate gradient algorithm are tested. The variants differ whether or not the time component is used, and whether or not additional gradient information is utilized during one-dimensional optimization. Testing is performed on randomly generated data as well as on some benchmark data regarding energy prediction. Based on our test results, it appears that the most promissing backpropagation strategy is to initially use steepest descent algorithm, and then continue with con-jugate gradient algorithm. The backpropagation through time strategy combined with conjugate gradients appears to be promissing as well.
Proceedings of the IEEE, 1990
Backpropagation is now the most widely used tool in the field of artificial neural networks. At the core of backpropagation is a method for calculating derivatives exactly and efficiently in any large system made up of elementary subsystems or calculations which are represented by known, differentiable functions; thus, backpropagation has many applications which do not involve neural networks as such. This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis. Next, it presents the basic equations for backpropagation through time, and discusses applications to areas like pattern recognition involving dynamic systems, systems identification, and control. Finally, it describes further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations or true recurrent networks, and other practical issues which arise with this method. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed.
2015
Robustness to particular transformations is a desired property in many classification tasks. For example, in image classification tasks the predictions should be invariant to variations in location, size, angle, brightness, etc. Standard neural networks do not have this property. We propose an extension of the backpropagation algorithm that trains a neural network to be robust to variations and noise in the feature vector. This extension consists of an additional forward pass performed on the derivatives that are obtained in the end of the backward pass. We perform a theoretical and experimental comparison with the standard BP, and two other the most similar approaches (Tangent BP and Adversarial Training). As a result, we show how both of them can be sped up on approximately 20%. We evaluate our algorithm on a collection of datasets for image classification, confirm its theoretically established properties and demonstrate an improvement of the classification accuracy with respect to the competing algorithms in the majority of cases. * http://www.demyanov.net
IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), 1999
Possible paradigms for concept learning by feedforward neural networks include discrimination and recognition. An interesting aspect of this dichotomy is that the recognition-based implementation can learn certain domains much more eciently than the discrimination-based one, despite the close structural relationship between the two systems. The purpose of this paper is to explain this di erence in e ciency. We suggest that it is caused by a di erence in the generalization strategy adopted by the Backpropagation procedure in both cases: while the autoassociator uses a (fast) bottom-up strategy, MLP has recourse to a (slow) top-down one, despite the fact that the two systems are both optimized by the Backpropagation procedure. This result is important because it sheds some light on the nature of Backpropagation's adaptative capability. From a practical viewpoint, it suggests a deterministic way to increase the e ciency of Backpropagation-trained feedforward networks.
IEEE Transactions on Neural Networks, 2007
ORiON, 2009
The feedforward neural network architecture uses backpropagation learning to determine optimal weights between different interconnected layers. This learning procedure uses a gradient descent technique applied to a sum-of-squares error function for the given inputoutput pattern. It employs an iterative procedure to minimise the error function for a given set of patterns, by adjusting the weights of the network. The first derivates of the error with respect to the weights identify the local error surface in the descent direction. Hence the network exhibits a different local error surface for every different pattern presented to it, and weights are iteratively modified in order to minimise the current local error. The determination of an optimal weight vector is possible only when the total minimum error (mean of the minimum local errors) for all patterns from the training set may be minimised. In this paper, we present a general mathematical formulation for the second derivative of the error function with respect to the weights (which represents a conjugate descent) for arbitrary feedforward neural network topologies, and we use this derivative information to obtain the optimal weight vector. The local error is backpropagated among the units of hidden layers via the second order derivative of the error with respect to the weights of the hidden and output layers independently and also in combination. The new total minimum error point may be evaluated with the help of the current total minimum error and the current minimised local error. The weight modification processes is performed twice: once with respect to the present local error and once more with respect to the current total or mean error. We present some numerical evidence that our proposed method yields better network weights than those determined via a conventional gradient descent approach.

Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
References (4)
- P.J. Werbos, Beyond Regression: New Tools For Prediction and Analysis in the Behavioral Sciences, Ph.D. Thesis, Harvard Uni- versity, Cambridge, MA, 1974.
- P. Werbos, "Backpropagation: Past and future," IEEE Int. Con- ference on Neural Networks, San Diego, California, July 1988, vol. I, pp. 343-353, 1988
- P. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley 1994
- D.E. Rummelhart, G.E. Hinton, and R.J. Williams "Learning internal representation by errror propagation," in Parallel Dis- tributed Processing: Exploration in the Microstructure of Cog- nition, D.E. Rummelhart and J.L. McClelland, Eds., vol. 1, Chap. 8, Cambridge, MA, MIT Press, 1986