Wikipedia record the following formula without proof:
I accidentally found that by the formulas in the previous post, we can already derive the following
Theorem. For every $\ell<L-1$, we let $\Phi^{[\ell]}:\R\to \R$ denote the activation function in the hidden layer, then we have\[
\frac{\partial \mathcal L }{\partial W^{[\ell]}}=\underbrace{\frac{1}{m}\Phi^{[\ell]}{}'(U^{[\ell]}) * \left[\prod_{i=\ell +1}^{L-1} (\Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot \frac{\partial \mathcal L}{\partial Y^{[L-1]}} }_{:=\delta_\ell} \cdot Y^{[\ell-1]T} = \delta_{\ell}\cdot Y^{[\ell-1]T}.\]
Here $*$ denotes the entrywise multiplication. Since $\displaystyle \frac{\partial \mathcal L}{\partial W^{[L]}}=\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\cdot Y^{[L-1]T}$, we also define \[ \boxed{\delta_L = \frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}} \]and since
\[\frac{\partial \mathcal L}{\partial W^{[L-1]}} =\frac{1}{m}\Phi^{[L-1]}{}'(U^{[L-1]})* \left( W^{[L]T} \cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\right) Y^{[L-2]T}=\delta_{L-1} Y^{[L-2]T} \]
with $\delta_{L-1} :=\frac{1}{m}\cdot \left( \Phi^{[L-1]}{}'(U^{[L-1]}) * W^{[L]T}\right)\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}$, by the definition of $\delta_\ell$ for $\ell<L-1$ above, we obtain for every $\ell\leq L-1$,
\[\boxed{ \delta_{\ell} = \frac{1}{m}\cdot \Phi^{[\ell]}{}'(U^{[\ell]}) * \left[W^{[\ell+1]T} \cdot \delta_{\ell+1}\right]\quad \text{with}\quad \frac{\partial \mathcal L}{\partial W^{[\ell]}} = \delta_\ell Y^{[\ell-1]T}.} \]
And as a side consequence of our computation, since $\displaystyle\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}} = \delta_\ell$,
\[
\boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]}} = \text{np.sum}(\delta_\ell,\text{axis=1}).}
\]
The last two formulars are computationally very useful. Note that in the definition of $\delta_\ell$, the multiplication in the product notation will not make sense unless they act on the rightmost matrix $\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L-1]}} $ in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define $dW = \partial \mathcal L /\partial W$ and similarly for other matrices.
Proof. By repeated use of the formular $dY^{[\ell]} = [W^{[\ell+1]T}dY^{[\ell+1]}] * \Phi^{[\ell+1]}(U^{[\ell+1]})$ we have
\[\begin{align*}
dW^{[\ell]}& = \frac{1}{m} dU^{[\ell]} Y^{[\ell-1]T}\\
&=\frac{1}{m}\left(\left[dY^{[\ell]}\right] * \Phi^{[\ell]}{}'(U^{[\ell]})\right) Y^{[\ell-1]T}\\
&=\frac{1}{m}\left( \Phi^{[\ell]'}(U^{[\ell]})* \left[\prod_{i=\ell+1}^{L-1} \Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot dY^{[L-1]}\right) \cdot Y^{[\ell-1]T}
\end{align*}
\] And recall that $dY^{[L]} =\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L]}}. \qed$
No comments:
Post a Comment