\( \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\P}{\mathcal P} \newcommand{\B}{\mathcal B} \newcommand{\F}{\mathbb{F}} \newcommand{\E}{\mathcal E} \newcommand{\brac}[1]{\left(#1\right)} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\matrixx}[1]{\begin{bmatrix}#1\end {bmatrix}} \newcommand{\vmatrixx}[1]{\begin{vmatrix} #1\end{vmatrix}} \newcommand{\lims}{\mathop{\overline{\lim}}} \newcommand{\limi}{\mathop{\underline{\lim}}} \newcommand{\limn}{\lim_{n\to\infty}} \newcommand{\limsn}{\lims_{n\to\infty}} \newcommand{\limin}{\limi_{n\to\infty}} \newcommand{\nul}{\mathop{\mathrm{Nul}}} \newcommand{\col}{\mathop{\mathrm{Col}}} \newcommand{\rank}{\mathop{\mathrm{Rank}}} \newcommand{\dis}{\displaystyle} \newcommand{\spann}{\mathop{\mathrm{span}}} \newcommand{\range}{\mathop{\mathrm{range}}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerr}[1]{\left\langle #1 \right \rangle} \newcommand{\ol}[1]{\overline{#1}} \newcommand{\toto}{\rightrightarrows} \newcommand{\upto}{\nearrow} \newcommand{\downto}{\searrow} \newcommand{\qed}{\quad \blacksquare} \newcommand{\tr}{\mathop{\mathrm{tr}}} \newcommand{\bm}{\boldsymbol} \newcommand{\cupp}{\bigcup} \newcommand{\capp}{\bigcap} \newcommand{\sqcupp}{\bigsqcup} \newcommand{\re}{\mathop{\mathrm{Re}}} \newcommand{\im}{\mathop{\mathrm{Im}}} \newcommand{\comma}{\text{,}} \newcommand{\foot}{\text{。}} \)

Tuesday, September 29, 2020

Derive the Formula of $\displaystyle \frac{\partial \mathcal L}{\partial W^{[\ell]}}$

Wikipedia record the following formula without proof:


I accidentally found that by the formulas in the previous post, we can already derive the following

Theorem. For every $\ell<L-1$, we let $\Phi^{[\ell]}:\R\to \R$ denote the activation function in the hidden layer, then we have\[ \frac{\partial \mathcal L }{\partial W^{[\ell]}}=\underbrace{\frac{1}{m}\Phi^{[\ell]}{}'(U^{[\ell]}) * \left[\prod_{i=\ell +1}^{L-1} (\Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot \frac{\partial \mathcal L}{\partial Y^{[L-1]}} }_{:=\delta_\ell} \cdot Y^{[\ell-1]T} = \delta_{\ell}\cdot Y^{[\ell-1]T}.\] Here $*$ denotes the entrywise multiplication. Since $\displaystyle \frac{\partial \mathcal L}{\partial W^{[L]}}=\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\cdot Y^{[L-1]T}$, we also define \[ \boxed{\delta_L = \frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}}    \]and since \[\frac{\partial \mathcal L}{\partial W^{[L-1]}} =\frac{1}{m}\Phi^{[L-1]}{}'(U^{[L-1]})* \left( W^{[L]T} \cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\right) Y^{[L-2]T}=\delta_{L-1} Y^{[L-2]T} \] with $\delta_{L-1} :=\frac{1}{m}\cdot \left( \Phi^{[L-1]}{}'(U^{[L-1]}) * W^{[L]T}\right)\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}$, by the definition of $\delta_\ell$ for $\ell<L-1$ above, we obtain for every $\ell\leq L-1$, \[\boxed{ \delta_{\ell} = \frac{1}{m}\cdot \Phi^{[\ell]}{}'(U^{[\ell]}) * \left[W^{[\ell+1]T} \cdot \delta_{\ell+1}\right]\quad \text{with}\quad \frac{\partial \mathcal L}{\partial W^{[\ell]}} = \delta_\ell Y^{[\ell-1]T}.} \] And as a side consequence of our computation, since $\displaystyle\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}} = \delta_\ell$, \[ \boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]}} = \text{np.sum}(\delta_\ell,\text{axis=1}).} \]

The last two formulars are computationally very useful. Note that in the definition of $\delta_\ell$, the multiplication in the product notation will not make sense unless they act on the rightmost matrix $\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L-1]}} $ in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define $dW = \partial \mathcal L /\partial W$ and similarly for other matrices.

Proof. By repeated use of the formular $dY^{[\ell]} = [W^{[\ell+1]T}dY^{[\ell+1]}] * \Phi^{[\ell+1]}(U^{[\ell+1]})$ we have \[\begin{align*} dW^{[\ell]}& = \frac{1}{m} dU^{[\ell]} Y^{[\ell-1]T}\\ &=\frac{1}{m}\left(\left[dY^{[\ell]}\right] * \Phi^{[\ell]}{}'(U^{[\ell]})\right) Y^{[\ell-1]T}\\ &=\frac{1}{m}\left( \Phi^{[\ell]'}(U^{[\ell]})* \left[\prod_{i=\ell+1}^{L-1} \Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot dY^{[L-1]}\right) \cdot Y^{[\ell-1]T} \end{align*} \] And recall that $dY^{[L]} =\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L]}}. \qed$

Sunday, September 27, 2020

Formulas Revisit

When I study Andrew Ng course I am used to formulas like $dZ^{[\ell]}, dA$ , etc notations. Recently I revisit the topic and I am then used to using the notation \[u_i^{[\ell]} = W^{[\ell]}_{i:}\cdot y^{[\ell-1]} + b^{[\ell]}\quad\text{and}\quad y^{[\ell]} = \Phi^{[\ell]}(u),\] so I want to record the corresponding formulas for computation. Since the notation $dW$ doesn't look any cleaner than $\displaystyle\frac{\partial \mathcal L}{\partial W}$, in the sequel I write everything explicitly. \[ \boxed{\frac{\partial \mathcal L}{\partial W^{[\ell]}} = \frac{1}{m}\cdot \frac{\partial \mathcal L }{\partial U^{[\ell]}}\cdot Y^{[\ell-1]T}} \] \[ \boxed{\frac{\partial \mathcal L}{\partial U^{[\ell]}} = \brac{ W^{[\ell +1]T} \cdot \frac{\partial \mathcal L}{\partial U^{[\ell+1]}} }* \Phi^{[\ell]}{}'(U^{[\ell]})} \] where $*$ denotes entrywise product of matrices. \[ \boxed{\frac{\partial \mathcal L}{ \partial Y^{[\ell - 1]}} = W^{[\ell]T}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}}} \] The last two yield the following for $\ell<L$ the hidden layer and for $\Phi^{[\ell]}:\R\to \R$ the activation function at $\ell$-th layer. \[ \boxed{\frac{\partial \mathcal L}{\partial U^{[\ell]}} = \frac{\partial \mathcal L}{\partial Y^{[\ell]}} * \Phi^{[\ell]}{}'(U^{[\ell]})} \] and finally \[ \boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]} } =\frac{1}{m}\cdot \sum_{i=1}^m \frac{\partial \mathcal L}{\partial u^{[\ell](i)}} =\frac{1}{m}\cdot \text{np.sum}\brac{\frac{\partial \mathcal L}{\partial U^{[\ell]}},\text{axis = 1}}} \] For derivation of these formulas one can visit my another post: https://checkerlee.blogspot.com/2019/11/important-formulas-in-backward.html#more

Saturday, September 26, 2020

Intutive derivation of Cross Entropy as a "loss" function

In defining "loss" function for classification problems given $p_i=\mathbb P\{\text{$i$ occurs}\}$, $i=1,2,\dots,n$,  from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) $[q_1,q_2,\dots,q_n]$ by the cross-entropy: \[L=\sum_{i=1}^n p_i\ln q_i.\] Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!

        Let's take an example, consider flipping a coin with getting a head with probability $p$ and tail with $1-p$, then the probability of getting 2 heads out of 6 flipping is \[ L(p) = \binom{6}{2} p^2 (1-p)^4 = 15 p^2(1-p)^4. \] Maximum-likelihood estimation ask the following problem:

The phenomenon of getting 2 heads is most likely to happen under what value of $p$?

In other words, the above question is the same as at what value of $p$ the proability $L(p)$ gets maximized? By simply solving $L'(p)=0$ we get the answer $p=\frac{1}{3}$.

        But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability $p$ directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value $p$ is considered as a good estimation.

        Now the derivation of cross-entropy will be very intuitive: Assume that \[ \text{mutally disjoint }E_i=\{\text{$i$ occurs}\},\quad \mathbb P(E_i) = p_i, \quad i=1,2,3,\dots,n. \] And assume further that $E_i$'s are iid events. Consider events $A_1,\dots,A_N$ are such that $A_i = \cupp_{i=1}^n E_i$ for each $i$ (for example, flipping coins $N$ times), then $p_i = N_i/N$, where $N_i$ is the number of times $i$ occures among $A_1,\dots,A_N$.

        Now we get another estimation $q_i$ of the same event $E_i$ from what ever experiment we can imagine. How good is $[q_1,\dots,q_n]$ as an estimation to the past emperical data $[p_1,\dots,p_n]$? The standard distance in $\R^n$ is certainly not a good choice since an quantity $\epsilon$ from $q_i$ to $p_i$ can mean huge difference from $q_{i'}$ to $p_{i'}$. $[q_1,\dots,q_n]$ is considered as good estimation if the observed phenomenon \[ \{\text{1 appears $N_1$ times}\}, \quad \{\text{2 appears $N_2$ times}\},\quad \dots ,\quad \{\text{n appears $N_n$ times} \} \] is very likely to happen under the estimates $[q_1,\dots,q_n]$, i.e., when \[ L = \prod_{i=1}^N q_i^{N_i}\iff \frac{\ln L}{N}= \sum_{i=1}^N \frac{N_i}{N}\ln q_i = \sum_{i=1}^n p_i\ln q_i. \] is large, and we have derived the cross-entropy at this point.

Sunday, September 20, 2020

Useful command in git review course:

git rm --cached -r build/
If we type git rm -h we get an explanaation that:
--cached only remove from the index
by index it means tracked files inside the staging area (files are always tracked once they get committed.) A gitignore has no effect to files that is already tracked, so we use git rm --cached.