\( \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\P}{\mathcal P} \newcommand{\B}{\mathcal B} \newcommand{\F}{\mathbb{F}} \newcommand{\E}{\mathcal E} \newcommand{\brac}[1]{\left(#1\right)} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\matrixx}[1]{\begin{bmatrix}#1\end {bmatrix}} \newcommand{\vmatrixx}[1]{\begin{vmatrix} #1\end{vmatrix}} \newcommand{\lims}{\mathop{\overline{\lim}}} \newcommand{\limi}{\mathop{\underline{\lim}}} \newcommand{\limn}{\lim_{n\to\infty}} \newcommand{\limsn}{\lims_{n\to\infty}} \newcommand{\limin}{\limi_{n\to\infty}} \newcommand{\nul}{\mathop{\mathrm{Nul}}} \newcommand{\col}{\mathop{\mathrm{Col}}} \newcommand{\rank}{\mathop{\mathrm{Rank}}} \newcommand{\dis}{\displaystyle} \newcommand{\spann}{\mathop{\mathrm{span}}} \newcommand{\range}{\mathop{\mathrm{range}}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerr}[1]{\left\langle #1 \right \rangle} \newcommand{\ol}[1]{\overline{#1}} \newcommand{\toto}{\rightrightarrows} \newcommand{\upto}{\nearrow} \newcommand{\downto}{\searrow} \newcommand{\qed}{\quad \blacksquare} \newcommand{\tr}{\mathop{\mathrm{tr}}} \newcommand{\bm}{\boldsymbol} \newcommand{\cupp}{\bigcup} \newcommand{\capp}{\bigcap} \newcommand{\sqcupp}{\bigsqcup} \newcommand{\re}{\mathop{\mathrm{Re}}} \newcommand{\im}{\mathop{\mathrm{Im}}} \newcommand{\comma}{\text{,}} \newcommand{\foot}{\text{。}} \)

Wednesday, November 13, 2019

Derivation of Important formulas in backward propagation

In Andrew Ng's first course of deep learning one will encounter the following formulas that we need to translate in Python:


The key idea for backward propagation is to calculate $dW^{[\ell]}$ and $db^{[\ell]}$ for each layer and then update by $W^{[\ell]} = W^{[\ell]} -\alpha dW^{[\ell]}$ and $b = b -\alpha db^{[\ell]}$. To do this, the following formula says that it is enough to know $dA^{[\ell]}$ and $dZ^{[\ell]}$ for each layer $\ell$. Thanks to (4) and (5) our process can propagate backwards from top layer to the bottom to complete all those computations, this is the key reason why we need the following formulas.

Consider the case of only one training example, then the cost function $C$ can be expressed as a lost function $\mathcal L$ by \[
C = \mathcal L(a^{[L]},y),
\]define  \[d z_i^{[\ell]} = \frac{\partial  \mathcal L}{\partial z_i^{[\ell]}},\quad d a_i^{[\ell]} = \frac{\partial  \mathcal L}{\partial a_i^{[\ell]}},\quad z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} +b^{[\ell]},\quad d W^{[\ell]} := \frac{\partial C}{\partial W^{[\ell]}}\] for each $\ell$-th layer. Note that we deliberately define $d z^{[\ell]}$ and $d a^{[\ell]}$ with numerator $\mathcal L$, but not $C$, this is not a typo. We are going to show that:
\[
\boxed{
d W_{ij}^{[\ell]} = d z_i^{[\ell]} a_j^{[\ell-1]}
}
\tag{1}
\] and we then extend it to $m$ training examples with $C=\frac{1}{m}\sum_{i=1}^{m} \mathcal L(a^{[L](i)},y^{(i)})$ and  \[
\boxed{
d W^{[\ell]} = \frac{1}{m} d Z^{[\ell]}  A^{[\ell -1]T}}
\tag{2}
\] with $d Z^{[\ell]} =
\begin{bmatrix}
d z^{[\ell](1)} &\cdots & d z^{[\ell](m)}
\end{bmatrix}$ and $A^{[\ell]} =
\begin{bmatrix}
 a^{[\ell](1)} &\cdots&  a^{[\ell](m)}
\end{bmatrix}
$. We also have \[
\boxed{
d z^{[\ell]} = (W^{[\ell+1]T}d z^{[\ell+1]}) * g^{[\ell]}{}'(z^{[\ell]})
}
 \tag{3}
\] for the case of only one training example, and from this by stacking the columns horizontally we get \[
\boxed{
d Z^{[\ell]}  = W^{[\ell + 1]T} d Z^{[\ell+1]} * g^{[\ell]}{}'(Z^{[\ell]})
}
\] for generally $m$ training examples. Next, \[
\boxed{
d A^{[\ell-1]} = W^{[\ell]T} d Z^{[\ell]}
}.\tag{4}
\] By combining last two results we get \[\
\boxed{
d Z^{[\ell]} = d A^{[\ell]} * g^{[\ell]}{}'(Z^{[\ell]})
}.\tag{5}
\] We also mention without proof that \[\boxed{
d b^{[\ell]} =\frac{1}{m} \sum_{i=1}^m d z^{[\ell](i)} =\frac{1}{m}\mathrm{np.sum}(dZ^{[\ell]}, \text{axis = 1}, \text{keepdims = True})}
\] as this simply follows from chain rule and a matter of of notations.

Proof of (1).  Consider the case of only one training example, then
\[
\begin{align*}
    d W_{ij}^{[\ell]} &:= \frac{\partial C}{\partial W_{ij}^{[\ell]}} \\
    &= \frac{\partial C}{\partial z^{[\ell]} } \frac{\partial z^{[\ell]}}{\partial W_{ij}^{[\ell]}}\\
    &= \sum_{k} \frac{\partial C }{\partial z_k^{[\ell]} }  \frac{\partial z_k^{[\ell]}}{\partial W_{ij}^{[\ell]}} \\
    &= \sum_k \sum_{u}d z_k^{[\ell]}  \frac{\partial W_{ku}^{[\ell]}}{\partial W_{ij}^{[\ell]}} a_u^{[\ell-1]},
    \end{align*}
\] and only when $k=i$ and $u=j$ the summand survive, therefore \[
  d W_{ij}^{[\ell]}  = d z_i^{[\ell]} a_j^{[\ell-1]}.
\] This establizhed the first equality.$\qed$

Proof of (2). Next, when $C=\frac{1}{m}\sum_{k=1}^{m} \mathcal L(a^{[L](k)},y^{(k)})$, we use the previous formula termwise to obtain \[
d W_{ij}^{[\ell]} : = \frac{\partial C}{\partial W_{ij}^{[\ell]}} =\frac{1}{m}\sum_{k=1}^m \frac{\partial \mathcal  L}{\partial W_{ij}^{[\ell]}}  = \frac{1}{m} \sum_{k=1}^{m} d z_i^{[\ell](k)}   (a_j^{[\ell-1](k)}),
\] which each of the summands is evaluated at the point of the corresponding column of training example. Now for $Z=[Z_{ij}]$ and $A=[A_{ij}]$, we have by definition of tranpose of a matrix: \[
\sum_k Z_{ik} \cdot  A_{jk} = \sum_k Z_{ik} (A^T)_{kj} = (ZA^T)_{ij},
\] the above yields \[
d W_{ij}^{[\ell]} = \frac{1}{m} (d Z^{[\ell]} A^{[\ell-1]T})_{ij},
\] as desired.$\qed$

Proof of (3).  Again consider only one training example first. Since \[
d z_i^{[\ell]} = \sum_k \frac{\partial C}{\partial z_k^{[\ell +1]}} \frac{\partial z_k^{[\ell+1]}}{\partial z_i^{[\ell]}} = \sum_k d z_k^{[\ell+1]} W_{ki}^{[\ell+1]}\cdot g^{[\ell]{}'}(z_i^{[\ell]}) = (W^{[\ell+1]T})_{i:} d z^{[\ell+1]} \cdot g^{[\ell]}{}'(z_i^{[\ell]})
\]and by stacking the result vertically the last line yields:\[
d z^{[\ell]} = W^{[\ell+1]T}\cdot  d z^{[\ell+1]} * g^{[\ell]}{}'(z^{[\ell]}).
\] Note that $\cdot$ and $*$ are associative here.$\qed$

Proof of (4). Since \[
d a_i^{[\ell]} =\frac{\partial C}{\partial z^{[\ell+1]}} \frac{\partial z^{[\ell+1]}}{\partial a_i^{[\ell]}}= \sum_{j}d z_j^{[\ell+1]} W_{ji}^{[\ell+1]} = (W^{[\ell+1]T} d z^{[\ell+1]})_i,
\] the result follows from stacking the columns horizontally.$\qed$

No comments:

Post a Comment