Loading [MathJax]/jax/output/HTML-CSS/jax.js

Wednesday, November 13, 2019

Derivation of Important formulas in backward propagation

In Andrew Ng's first course of deep learning one will encounter the following formulas that we need to translate in Python:


The key idea for backward propagation is to calculate dW[] and db[] for each layer and then update by W[]=W[]αdW[] and b=bαdb[]. To do this, the following formula says that it is enough to know dA[] and dZ[] for each layer . Thanks to (4) and (5) our process can propagate backwards from top layer to the bottom to complete all those computations, this is the key reason why we need the following formulas.

Consider the case of only one training example, then the cost function C can be expressed as a lost function L by C=L(a[L],y),define  dz[]i=Lz[]i,da[]i=La[]i,z[]=W[]a[1]+b[],dW[]:=CW[] for each -th layer. Note that we deliberately define dz[] and da[] with numerator L, but not C, this is not a typo. We are going to show that:
dW[]ij=dz[]ia[1]j and we then extend it to m training examples with C=1mmi=1L(a[L](i),y(i)) and  dW[]=1mdZ[]A[1]T with dZ[]=[dz[](1)dz[](m)] and A[]=[a[](1)a[](m)]. We also have dz[]=(W[+1]Tdz[+1])g[](z[]) for the case of only one training example, and from this by stacking the columns horizontally we get dZ[]=W[+1]TdZ[+1]g[](Z[]) for generally m training examples. Next, dA[1]=W[]TdZ[]. By combining last two results we get  dZ[]=dA[]g[](Z[]). We also mention without proof that db[]=1mmi=1dz[](i)=1mnp.sum(dZ[],axis = 1,keepdims = True) as this simply follows from chain rule and a matter of of notations.

Proof of (1).  Consider the case of only one training example, then
dW[]ij:=CW[]ij=Cz[]z[]W[]ij=kCz[]kz[]kW[]ij=kudz[]kW[]kuW[]ija[1]u, and only when k=i and u=j the summand survive, therefore dW[]ij=dz[]ia[1]j. This establizhed the first equality.

Proof of (2). Next, when C=1mmk=1L(a[L](k),y(k)), we use the previous formula termwise to obtain dW[]ij:=CW[]ij=1mmk=1LW[]ij=1mmk=1dz[](k)i(a[1](k)j), which each of the summands is evaluated at the point of the corresponding column of training example. Now for Z=[Zij] and A=[Aij], we have by definition of tranpose of a matrix: kZikAjk=kZik(AT)kj=(ZAT)ij, the above yields dW[]ij=1m(dZ[]A[1]T)ij, as desired.

Proof of (3).  Again consider only one training example first. Since dz[]i=kCz[+1]kz[+1]kz[]i=kdz[+1]kW[+1]kig[](z[]i)=(W[+1]T)i:dz[+1]g[](z[]i)and by stacking the result vertically the last line yields:dz[]=W[+1]Tdz[+1]g[](z[]). Note that and are associative here.

Proof of (4). Since da[]i=Cz[+1]z[+1]a[]i=jdz[+1]jW[+1]ji=(W[+1]Tdz[+1])i, the result follows from stacking the columns horizontally.

No comments:

Post a Comment