Wikipedia record the following formula without proof:
I accidentally found that by the formulas in the previous post, we can already derive the following
Proof. By repeated use of the formular dY[ℓ]=[W[ℓ+1]TdY[ℓ+1]]∗Φ[ℓ+1](U[ℓ+1]) we have dW[ℓ]=1mdU[ℓ]Y[ℓ−1]T=1m([dY[ℓ]]∗Φ[ℓ]′(U[ℓ]))Y[ℓ−1]T=1m(Φ[ℓ]′(U[ℓ])∗[L−1∏i=ℓ+1Φ[i]′(U[i])∗W[i]T]⋅dY[L−1])⋅Y[ℓ−1]T And recall that dY[L]=∂L∂Y[L].◼
I accidentally found that by the formulas in the previous post, we can already derive the following
Theorem. For every ℓ<L−1, we let Φ[ℓ]:R→R denote the activation function in the hidden layer, then we have∂L∂W[ℓ]=1mΦ[ℓ]′(U[ℓ])∗[L−1∏i=ℓ+1(Φ[i]′(U[i])∗W[i]T]⋅∂L∂Y[L−1]⏟:=δℓ⋅Y[ℓ−1]T=δℓ⋅Y[ℓ−1]T.
Here ∗ denotes the entrywise multiplication. Since ∂L∂W[L]=1m⋅∂L∂U[L]⋅Y[L−1]T, we also define δL=1m⋅∂L∂U[L]and since
∂L∂W[L−1]=1mΦ[L−1]′(U[L−1])∗(W[L]T⋅∂L∂U[L])Y[L−2]T=δL−1Y[L−2]T
with δL−1:=1m⋅(Φ[L−1]′(U[L−1])∗W[L]T)⋅∂L∂U[L], by the definition of δℓ for ℓ<L−1 above, we obtain for every ℓ≤L−1,
δℓ=1m⋅Φ[ℓ]′(U[ℓ])∗[W[ℓ+1]T⋅δℓ+1]with∂L∂W[ℓ]=δℓY[ℓ−1]T.
And as a side consequence of our computation, since 1m⋅∂L∂U[ℓ]=δℓ,
∂L∂b[ℓ]=np.sum(δℓ,axis=1).
The last two formulars are computationally very useful. Note that in the definition of δℓ, the multiplication in the product notation will not make sense unless they act on the rightmost matrix ∂L∂Y[L−1] in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define dW=∂L/∂W and similarly for other matrices.
Proof. By repeated use of the formular dY[ℓ]=[W[ℓ+1]TdY[ℓ+1]]∗Φ[ℓ+1](U[ℓ+1]) we have dW[ℓ]=1mdU[ℓ]Y[ℓ−1]T=1m([dY[ℓ]]∗Φ[ℓ]′(U[ℓ]))Y[ℓ−1]T=1m(Φ[ℓ]′(U[ℓ])∗[L−1∏i=ℓ+1Φ[i]′(U[i])∗W[i]T]⋅dY[L−1])⋅Y[ℓ−1]T And recall that dY[L]=∂L∂Y[L].◼
No comments:
Post a Comment