Processing math: 100%

Tuesday, September 29, 2020

Derive the Formula of LW[]

Wikipedia record the following formula without proof:


I accidentally found that by the formulas in the previous post, we can already derive the following

Theorem. For every <L1, we let Φ[]:RR denote the activation function in the hidden layer, then we haveLW[]=1mΦ[](U[])[L1i=+1(Φ[i](U[i])W[i]T]LY[L1]:=δY[1]T=δY[1]T. Here denotes the entrywise multiplication. Since LW[L]=1mLU[L]Y[L1]T, we also define δL=1mLU[L]and since LW[L1]=1mΦ[L1](U[L1])(W[L]TLU[L])Y[L2]T=δL1Y[L2]T with δL1:=1m(Φ[L1](U[L1])W[L]T)LU[L], by the definition of δ for <L1 above, we obtain for every L1, δ=1mΦ[](U[])[W[+1]Tδ+1]withLW[]=δY[1]T. And as a side consequence of our computation, since 1mLU[]=δ, Lb[]=np.sum(δ,axis=1).

The last two formulars are computationally very useful. Note that in the definition of δ, the multiplication in the product notation will not make sense unless they act on the rightmost matrix LY[L1] in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define dW=L/W and similarly for other matrices.

Proof. By repeated use of the formular dY[]=[W[+1]TdY[+1]]Φ[+1](U[+1]) we have dW[]=1mdU[]Y[1]T=1m([dY[]]Φ[](U[]))Y[1]T=1m(Φ[](U[])[L1i=+1Φ[i](U[i])W[i]T]dY[L1])Y[1]T And recall that dY[L]=LY[L].

No comments:

Post a Comment