Symmetry of backpropagation and chain rule

Andrzej  Pacut

doi:10.1109/IJCNN.2002.1005528

Outline

Symmetry of backpropagation and chain rule

Andrzej Pacut

2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290)

https://doi.org/10.1109/IJCNN.2002.1005528

visibility

…

description

5 pages

link

1 file

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Gradient backpropagation, understood as a method of computing derivatives of composite functions, is commonly understood as a version of the chain rule. We show that this is not true, and both methods are in a sense opposite. While for the chain rule one needs derivatives with respect to all variables that influence a given intermediate variable, the backpropagation calls for derivatives of all variables that are influenced by the present variable. Knowing this, the derivation of the gradient for even complicated neural networks is almost trivial. In a matrix form, both methods differ in the order of matrix multiplication. Use of the chain rule is almost automatic since we all know it from math analysis education; use of the backpropagation could be as automatic if introduced in university math education as an equivalent alternative version of derivative calculation for composite functions.

Michael Vrahatis

Nonlinear Analysis: Theory, Methods & Applications, 1997

downloadDownload free PDF View PDFchevron_right

Linear Backprop in non-linear networks

Mehrdad Yazdani

Backprop is the primary learning algorithm used in many machine learning algorithms. In practice, however, Backprop in deep neural networks is a highly sensitive learning algorithm and successful learning depends on numerous conditions and constraints. One set of constraints is to avoid weights that lead to saturated units. The motivation for avoiding unit saturation is that gradients vanish and as a result learning comes to a halt. Careful weight initialization and re-scaling schemes such as batch normalization ensure that input activity to the neuron is within the linear regime where gradients are not vanished and can flow. Here we investigate backpropagating error terms only linearly. That is, we ignore the saturation that arise by ensuring gradients always flow. We refer to this learning rule as Linear Backprop since in the backward pass the network appears to be linear. In addition to ensuring persistent gradient flow, Linear Backprop is also favorable when computation is expensive since gradients are never computed. Our early results suggest that learning with Linear Backprop is competitive with Backprop and saves expensive gradient computations.

downloadDownload free PDF View PDFchevron_right

Differential Neural Networks (DNN)

Juan Gabriel Avina-Cervantes

IEEE Access

In this work, we propose an artificial neural network topology to estimate the derivative of a function. This topology is called a differential neural network because it allows the estimation of the derivative of any of the network outputs with respect to any of its inputs. The main advantage of a differential neural network is that it uses some of the weights of a multilayer neural network. Therefore, a differential neural network does not need to be trained. First, a multilayer neural network is trained to find the best set of weights that minimize an error function. Second, the weights of the trained network and its neuron activations are used to build a differential neural network. Consequently, a multilayer artificial neural can produce a specific output, and simultaneously, estimate the derivative of any of its outputs with respect to any of its inputs. Several computer simulations were carried out to validate the performance of the proposed method. The computer simulation results showed that differential neural networks are capable of estimating with good accuracy the derivative of a function. The method was developed for an artificial neural network with two layers; however, the method can be extended to more than two layers. Similarly, the analysis in this study is presented for two common activation functions. Nonetheless, other activation functions can be used as long as the derivative of the activation function can be computed. INDEX TERMS Differential neural network, artificial intelligence, neural network structure, derivative estimation, multilayer network.

downloadDownload free PDF View PDFchevron_right

An application of gradient-like dynamics to neural networks

Chaouki Abdallah

Conference Record Southcon, 1994

This paper reviews a formalism that enables the dynamics of a broad class of neural networks to be understood. This formalism is then applied to a specific network and the predicted and simulated behavior of the system are compared. A number of previous works have analysed the Lyapunov stability of neural network models. This type of analysis shows that the excursion of the solutions from a stable point is bounded. The purpose of this work is to review and then utili&e a model of the dynamics that also describes the phase space behavior and structural stability of the system. This is achieved by writing the general equations of the neural network dynamics as a gradient-like system. In this paper it is demonstrated that a network with additive activation dynamics and Hebbian weight update dynamics can be expressed tw a gradient-like system. An example of a 3-layer network with feedback between adjacent layers is presented. It is shown that the process of weight learning is stable in this network when the learned weights are symmetric. Furthermore, the weight learning process is stable when the learned weights are asymmetric, provided that the activation is computed using only the symmetric part of the weights.

downloadDownload free PDF View PDFchevron_right

Backpropagation via Nonlinear Optimization

Jadranka Skorin-Kapov

2015

In this paper we explore different strategies to guide backpropagation algorithm used for training artificial neural networks. Two different variants of steepest descent-based backpropagation algorithm, and four dif-ferent variants of conjugate gradient algorithm are tested. The variants differ whether or not the time component is used, and whether or not additional gradient information is utilized during one-dimensional optimization. Testing is performed on randomly generated data as well as on some benchmark data regarding energy prediction. Based on our test results, it appears that the most promissing backpropagation strategy is to initially use steepest descent algorithm, and then continue with con-jugate gradient algorithm. The backpropagation through time strategy combined with conjugate gradients appears to be promissing as well.

downloadDownload free PDF View PDFchevron_right

Backpropagation through time: what it does and how to do it

Paul Werbos

Proceedings of the IEEE, 1990

Backpropagation is now the most widely used tool in the field of artificial neural networks. At the core of backpropagation is a method for calculating derivatives exactly and efficiently in any large system made up of elementary subsystems or calculations which are represented by known, differentiable functions; thus, backpropagation has many applications which do not involve neural networks as such. This paper first reviews basic backpropagation, a simple method which is now being widely used in areas like pattern recognition and fault diagnosis. Next, it presents the basic equations for backpropagation through time, and discusses applications to areas like pattern recognition involving dynamic systems, systems identification, and control. Finally, it describes further extensions of this method, to deal with systems other than neural networks, systems involving simultaneous equations or true recurrent networks, and other practical issues which arise with this method. Pseudocode is provided to clarify the algorithms. The chain rule for ordered derivatives-the theorem which underlies backpropagation-is briefly discussed.

downloadDownload free PDF View PDFchevron_right

Invariant backpropagation: how to train a transformation-invariant neural network

Sergey Demyanov, Ramamohanarao Kotagiri

2015

Robustness to particular transformations is a desired property in many classification tasks. For example, in image classification tasks the predictions should be invariant to variations in location, size, angle, brightness, etc. Standard neural networks do not have this property. We propose an extension of the backpropagation algorithm that trains a neural network to be robust to variations and noise in the feature vector. This extension consists of an additional forward pass performed on the derivatives that are obtained in the end of the backward pass. We perform a theoretical and experimental comparison with the standard BP, and two other the most similar approaches (Tangent BP and Adversarial Training). As a result, we show how both of them can be sped up on approximately 20%. We evaluate our algorithm on a collection of datasets for image classification, confirm its theoretically established properties and demonstrate an improvement of the classification accuracy with respect to the competing algorithms in the majority of cases. * http://www.demyanov.net

downloadDownload free PDF View PDFchevron_right

Adaptability of the backpropagation procedure

Stephen J Hanson

IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), 1999

Possible paradigms for concept learning by feedforward neural networks include discrimination and recognition. An interesting aspect of this dichotomy is that the recognition-based implementation can learn certain domains much more eciently than the discrimination-based one, despite the close structural relationship between the two systems. The purpose of this paper is to explain this di erence in e ciency. We suggest that it is caused by a di erence in the generalization strategy adopted by the Backpropagation procedure in both cases: while the autoassociator uses a (fast) bottom-up strategy, MLP has recourse to a (slow) top-down one, despite the fact that the two systems are both optimized by the Backpropagation procedure. This result is important because it sheds some light on the nature of Backpropagation's adaptative capability. From a practical viewpoint, it suggests a deterministic way to increase the e ciency of Backpropagation-trained feedforward networks.

downloadDownload free PDF View PDFchevron_right

Backpropagation Algorithms for a Broad Class of Dynamic Networks

Orlando De Jesus

IEEE Transactions on Neural Networks, 2007

downloadDownload free PDF View PDFchevron_right

Conjugate descent formulation of backpropagation error in feedforward neural networks

Manu Pratap Singh

ORiON, 2009

The feedforward neural network architecture uses backpropagation learning to determine optimal weights between different interconnected layers. This learning procedure uses a gradient descent technique applied to a sum-of-squares error function for the given inputoutput pattern. It employs an iterative procedure to minimise the error function for a given set of patterns, by adjusting the weights of the network. The first derivates of the error with respect to the weights identify the local error surface in the descent direction. Hence the network exhibits a different local error surface for every different pattern presented to it, and weights are iteratively modified in order to minimise the current local error. The determination of an optimal weight vector is possible only when the total minimum error (mean of the minimum local errors) for all patterns from the training set may be minimised. In this paper, we present a general mathematical formulation for the second derivative of the error function with respect to the weights (which represents a conjugate descent) for arbitrary feedforward neural network topologies, and we use this derivative information to obtain the optimal weight vector. The local error is backpropagated among the units of hidden layers via the second order derivative of the error with respect to the weights of the hidden and output layers independently and also in combination. The new total minimum error point may be evaluated with the help of the current total minimum error and the current minimised local error. The weight modification processes is performed twice: once with respect to the present local error and once more with respect to the current total or mean error. We present some numerical evidence that our proposed method yields better network weights than those determined via a conventional gradient descent approach.

downloadDownload free PDF View PDFchevron_right

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (4)

P.J. Werbos, Beyond Regression: New Tools For Prediction and Analysis in the Behavioral Sciences, Ph.D. Thesis, Harvard Uni- versity, Cambridge, MA, 1974.
P. Werbos, "Backpropagation: Past and future," IEEE Int. Con- ference on Neural Networks, San Diego, California, July 1988, vol. I, pp. 343-353, 1988
P. Werbos, The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting, Wiley 1994
D.E. Rummelhart, G.E. Hinton, and R.J. Williams "Learning internal representation by errror propagation," in Parallel Dis- tributed Processing: Exploration in the Microstructure of Cog- nition, D.E. Rummelhart and J.L. McClelland, Eds., vol. 1, Chap. 8, Cambridge, MA, MIT Press, 1986

Related papers

Generalized Backpropagation

Daniel Crespin

Global backpropagation formulas for differentiable neural networks are considered from the viewpoint of minimization of the quadratic error using the gradient method. The gradient of (the quadratic error function of) a processing unit is expressed in terms of the output error and the transposed derivative of the unit with respect to the weight. The gradient of the layer is the product of the gradients of the processing units. The gradient of the network equals the product of the gradients of the layers. Backpropagation provides the desired outputs or targets for the layers. Standard formulas for semilinear networks are deduced as a special case. 14 pages.

downloadDownload free PDF View PDFchevron_right

The simple essence of automatic differentiation

Conal Elliott

Proceedings of the ACM on Programming Languages

Automatic differentiation (AD) in reverse mode (RAD) is a central component of deep learning and other uses of large-scale optimization. Commonly used RAD algorithms such as backpropagation, however, are complex and stateful, hindering deep understanding, improvement, and parallel execution. This paper develops a simple, generalized AD algorithm calculated from a simple, natural specification. The general algorithm is then specialized by varying the representation of derivatives. In particular, applying well-known constructions to a naive representation yields two RAD algorithms that are far simpler than previously known. In contrast to commonly used RAD implementations, the algorithms defined here involve no graphs, tapes, variables, partial derivatives, or mutation. They are inherently parallel-friendly, correct by construction, and usable directly from an existing programming language with no need for new data types or programming style, thanks to use of an AD-agnostic compiler p...

downloadDownload free PDF View PDFchevron_right

Gradient and Hamiltonian Dynamics Applied to Learning in Neural Networks

Chaouki Abdallah

Nips, 1995

An adaptive back-propagation algorithm is studied and compared with gradient descent (standard back-propagation) for on-line learning in two-layer neural networks with an arbitrary number of hidden units. Within a statistical mechanics framework, both numerical studies and a rigorous analysis show that the adaptive back-propagation method results in faster training by breaking the symmetry between hidden units more e ciently and by providing faster convergence to optimal generalization than gradient descent.

downloadDownload free PDF View PDFchevron_right

Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities

Tapani Raiko

Recently, we proposed to transform the outputs of each hidden neuron in a multi-layer perceptron network to have zero output and zero slope on average, and use separate shortcut connections to model the linear dependencies instead. We continue the work by firstly introducing a third transformation to normalize the scale of the outputs of each hidden neuron, and secondly by analyzing the connections to second order optimization methods. We show that the transformations make a simple stochastic gradient behave closer to second-order optimization methods and thus speed up learning. This is shown both in theory and with experiments. The experiments on the third transformation show that while it further increases the speed of learning, it can also hurt performance by converging to a worse local optimum, where both the inputs and outputs of many hidden neurons are close to zero.

downloadDownload free PDF View PDFchevron_right

Theory and Formulas for Backpropagation in Hilbert Spaces

Daniel Crespin

This paper provides a detailed proof of the backpropagation algorithm for single input data as stated in section 17, and for multiple input data as given in section 18. Our viewpoint is that backpropagation consists essentially in the calculation of the gradient of the quadratic error of a multilayer differentiable neural network having an architecture of Hilbert spaces. Along the way a general theory for such networks is outlined. The gradient is expressed, as expected, in terms of the error vectors and the transpose partial derivatives of the layers. Compare with [3] and note that all the present results apply without change to the case of Euclidean spaces (finite dimensional Hilbert spaces) hence to Cartesian spaces R n as well. In Numerical Calculus/Analysis there is the well known gradient descent method, a procedure much used to find or approach the minimum of real valued functions. Beyond the calculation of a gradient, backpropagation is the name given to gradient descent when applied to the particularities of neural networks. The topic has a very long history as revealed in [6]. Although categories are not formally used, there is a section of Figures containing twelve diagrams that in the fashion of objects and morphisms illustrate neural networks, their values on inputs (forward propagation), their derivatives, transpose derivatives, backpropagated errors and lifted errors, these liftings being up to a numerical factor of 2 the components of the sought gradient of the quadratic error.

downloadDownload free PDF View PDFchevron_right

Backpropagation

Csaba Szepesvari

We propose a neural net solution for the recognition of the domain-types or pmLei ns, which is H. IlH.f'd and irnporLanL probl em in hiol og)'. WI:' 1m'll:' rOlJnd thaL w;ing a clever· pf' f'pronss if1g ('edHliqIJp. r'elaLively HTlall nCII ral ncLwork..; perrorrn HII rpT·iscflg l.y well. Thc perrorrrmTices or the neural nets were measured by cross-validation and Hoeffding's inequality ,yas uti lized for the estimation of a confidence interval of the estimates.

downloadDownload free PDF View PDFchevron_right

A modified error function for the backpropagation algorithm

masahiro ishii

Neurocomputing, 2004

We have noted that the local minima problem in the backpropagation algorithm is usually caused by update disharmony between weights connected to the hidden layer and the output layer. To solve this problem, we propose a modiÿed error function. It can harmonize the update of weights connected to the hidden layer and those connected to the output layer by adding one term to the conventional error function. It can thus avoid the local minima problem caused by such disharmony. Simulations on a benchmark problem and a real classiÿcation task have been performed to test the validity of the modiÿed error function.

downloadDownload free PDF View PDFchevron_right

On learning the derivatives of an unknown mapping with multilayer feedforward networks

A.ronald Gallant

Neural Networks, 1992

Recently, multiple input, single output, single hidden layer, feedforward neural networks have been shown to be capable of approximating a nonlinear map and its partial derivatives. Specifically, neural nets have been shown to be dense in various Sobolev spaces (Hornik, Stinchcombe and White, 1989). Building upon this result, we show that a net can be trained so that the map and its derivatives are learned. Specifically, we use a result of Gallant (1987b) to show that least squares and similar estimates are strongly consistent in Sobolev norm provided the number of hidden units and the size of the training set increase together. We illustrate these results by an applic~tion to the inverse problem of chaotic dynamics: recovery of a nonlinear map from a time series of iterates. These results extend automatically to nets that embed the single hidden layer, feedforward network as a special case.

downloadDownload free PDF View PDFchevron_right

Symmetry of backpropagation and chain rule

Sign up for access to the world's latest research

Abstract

Related papers

References (4)

Related papers

Related topics