\( \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\P}{\mathcal P} \newcommand{\B}{\mathcal B} \newcommand{\F}{\mathbb{F}} \newcommand{\E}{\mathcal E} \newcommand{\brac}[1]{\left(#1\right)} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\matrixx}[1]{\begin{bmatrix}#1\end {bmatrix}} \newcommand{\vmatrixx}[1]{\begin{vmatrix} #1\end{vmatrix}} \newcommand{\lims}{\mathop{\overline{\lim}}} \newcommand{\limi}{\mathop{\underline{\lim}}} \newcommand{\limn}{\lim_{n\to\infty}} \newcommand{\limsn}{\lims_{n\to\infty}} \newcommand{\limin}{\limi_{n\to\infty}} \newcommand{\nul}{\mathop{\mathrm{Nul}}} \newcommand{\col}{\mathop{\mathrm{Col}}} \newcommand{\rank}{\mathop{\mathrm{Rank}}} \newcommand{\dis}{\displaystyle} \newcommand{\spann}{\mathop{\mathrm{span}}} \newcommand{\range}{\mathop{\mathrm{range}}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerr}[1]{\left\langle #1 \right \rangle} \newcommand{\ol}[1]{\overline{#1}} \newcommand{\toto}{\rightrightarrows} \newcommand{\upto}{\nearrow} \newcommand{\downto}{\searrow} \newcommand{\qed}{\quad \blacksquare} \newcommand{\tr}{\mathop{\mathrm{tr}}} \newcommand{\bm}{\boldsymbol} \newcommand{\cupp}{\bigcup} \newcommand{\capp}{\bigcap} \newcommand{\sqcupp}{\bigsqcup} \newcommand{\re}{\mathop{\mathrm{Re}}} \newcommand{\im}{\mathop{\mathrm{Im}}} \newcommand{\comma}{\text{,}} \newcommand{\foot}{\text{。}} \)

Tuesday, September 29, 2020

Derive the Formula of $\displaystyle \frac{\partial \mathcal L}{\partial W^{[\ell]}}$

Wikipedia record the following formula without proof:


I accidentally found that by the formulas in the previous post, we can already derive the following

Theorem. For every $\ell<L-1$, we let $\Phi^{[\ell]}:\R\to \R$ denote the activation function in the hidden layer, then we have\[ \frac{\partial \mathcal L }{\partial W^{[\ell]}}=\underbrace{\frac{1}{m}\Phi^{[\ell]}{}'(U^{[\ell]}) * \left[\prod_{i=\ell +1}^{L-1} (\Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot \frac{\partial \mathcal L}{\partial Y^{[L-1]}} }_{:=\delta_\ell} \cdot Y^{[\ell-1]T} = \delta_{\ell}\cdot Y^{[\ell-1]T}.\] Here $*$ denotes the entrywise multiplication. Since $\displaystyle \frac{\partial \mathcal L}{\partial W^{[L]}}=\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\cdot Y^{[L-1]T}$, we also define \[ \boxed{\delta_L = \frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}}    \]and since \[\frac{\partial \mathcal L}{\partial W^{[L-1]}} =\frac{1}{m}\Phi^{[L-1]}{}'(U^{[L-1]})* \left( W^{[L]T} \cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\right) Y^{[L-2]T}=\delta_{L-1} Y^{[L-2]T} \] with $\delta_{L-1} :=\frac{1}{m}\cdot \left( \Phi^{[L-1]}{}'(U^{[L-1]}) * W^{[L]T}\right)\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}$, by the definition of $\delta_\ell$ for $\ell<L-1$ above, we obtain for every $\ell\leq L-1$, \[\boxed{ \delta_{\ell} = \frac{1}{m}\cdot \Phi^{[\ell]}{}'(U^{[\ell]}) * \left[W^{[\ell+1]T} \cdot \delta_{\ell+1}\right]\quad \text{with}\quad \frac{\partial \mathcal L}{\partial W^{[\ell]}} = \delta_\ell Y^{[\ell-1]T}.} \] And as a side consequence of our computation, since $\displaystyle\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}} = \delta_\ell$, \[ \boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]}} = \text{np.sum}(\delta_\ell,\text{axis=1}).} \]

The last two formulars are computationally very useful. Note that in the definition of $\delta_\ell$, the multiplication in the product notation will not make sense unless they act on the rightmost matrix $\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L-1]}} $ in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define $dW = \partial \mathcal L /\partial W$ and similarly for other matrices.

Proof. By repeated use of the formular $dY^{[\ell]} = [W^{[\ell+1]T}dY^{[\ell+1]}] * \Phi^{[\ell+1]}(U^{[\ell+1]})$ we have \[\begin{align*} dW^{[\ell]}& = \frac{1}{m} dU^{[\ell]} Y^{[\ell-1]T}\\ &=\frac{1}{m}\left(\left[dY^{[\ell]}\right] * \Phi^{[\ell]}{}'(U^{[\ell]})\right) Y^{[\ell-1]T}\\ &=\frac{1}{m}\left( \Phi^{[\ell]'}(U^{[\ell]})* \left[\prod_{i=\ell+1}^{L-1} \Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot dY^{[L-1]}\right) \cdot Y^{[\ell-1]T} \end{align*} \] And recall that $dY^{[L]} =\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L]}}. \qed$

No comments:

Post a Comment