格仔 Blog: September 2020

Tuesday, September 29, 2020

Derive the Formula of $\displaystyle \frac{\partial \mathcal L}{\partial W^{[\ell]}}$

Wikipedia record the following formula without proof:

I accidentally found that by the formulas in the previous post, we can already derive the following

Theorem. For every

$\ell<L-1$ , we let

$\Phi^{[\ell]}:\R\to \R$ denote the activation function in the hidden layer, then we have

$\frac{\partial \mathcal L }{\partial W^{[\ell]}}=\underbrace{\frac{1}{m}\Phi^{[\ell]}{}'(U^{[\ell]}) * \left[\prod_{i=\ell +1}^{L-1} (\Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot \frac{\partial \mathcal L}{\partial Y^{[L-1]}} }_{:=\delta_\ell} \cdot Y^{[\ell-1]T} = \delta_{\ell}\cdot Y^{[\ell-1]T}.$ Here

$*$ denotes the entrywise multiplication. Since

$\displaystyle \frac{\partial \mathcal L}{\partial W^{[L]}}=\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\cdot Y^{[L-1]T}$ , we also define

$\boxed{\delta_L = \frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}}$ and since

$\frac{\partial \mathcal L}{\partial W^{[L-1]}} =\frac{1}{m}\Phi^{[L-1]}{}'(U^{[L-1]})* \left( W^{[L]T} \cdot \frac{\partial \mathcal L}{\partial U^{[L]}}\right) Y^{[L-2]T}=\delta_{L-1} Y^{[L-2]T}$ with

$\delta_{L-1} :=\frac{1}{m}\cdot \left( \Phi^{[L-1]}{}'(U^{[L-1]}) * W^{[L]T}\right)\cdot \frac{\partial \mathcal L}{\partial U^{[L]}}$ , by the definition of

$\delta_\ell$ for

$\ell<L-1$ above, we obtain for every

$\ell\leq L-1$ ,

$\boxed{ \delta_{\ell} = \frac{1}{m}\cdot \Phi^{[\ell]}{}'(U^{[\ell]}) * \left[W^{[\ell+1]T} \cdot \delta_{\ell+1}\right]\quad \text{with}\quad \frac{\partial \mathcal L}{\partial W^{[\ell]}} = \delta_\ell Y^{[\ell-1]T}.}$ And as a side consequence of our computation, since

$\displaystyle\frac{1}{m}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}} = \delta_\ell$ ,

$\boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]}} = \text{np.sum}(\delta_\ell,\text{axis=1}).}$

The last two formulars are computationally very useful. Note that in the definition of

$\delta_\ell$ , the multiplication in the product notation will not make sense unless they act on the rightmost matrix

$\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L-1]}}$ in a correct order (from the biggest index). To simplify notations we follow Andrew Ng's course to define

$dW = \partial \mathcal L /\partial W$ and similarly for other matrices.

Proof. By repeated use of the formular

$dY^{[\ell]} = [W^{[\ell+1]T}dY^{[\ell+1]}] * \Phi^{[\ell+1]}(U^{[\ell+1]})$ we have

$\begin{align*} dW^{[\ell]}& = \frac{1}{m} dU^{[\ell]} Y^{[\ell-1]T}\\ &=\frac{1}{m}\left(\left[dY^{[\ell]}\right] * \Phi^{[\ell]}{}'(U^{[\ell]})\right) Y^{[\ell-1]T}\\ &=\frac{1}{m}\left( \Phi^{[\ell]'}(U^{[\ell]})* \left[\prod_{i=\ell+1}^{L-1} \Phi^{[i]}{}'(U^{[i]}) * W^{[i]T}\right]\cdot dY^{[L-1]}\right) \cdot Y^{[\ell-1]T} \end{align*}$ And recall that

$dY^{[L]} =\displaystyle \frac{\partial \mathcal L}{\partial Y^{[L]}}. \qed$

Sunday, September 27, 2020

Formulas Revisit

When I study Andrew Ng course I am used to formulas like

$dZ^{[\ell]}, dA$ , etc notations. Recently I revisit the topic and I am then used to using the notation

$u_i^{[\ell]} = W^{[\ell]}_{i:}\cdot y^{[\ell-1]} + b^{[\ell]}\quad\text{and}\quad y^{[\ell]} = \Phi^{[\ell]}(u),$ so I want to record the corresponding formulas for computation. Since the notation

$dW$ doesn't look any cleaner than

$\displaystyle\frac{\partial \mathcal L}{\partial W}$ , in the sequel I write everything explicitly.

$\boxed{\frac{\partial \mathcal L}{\partial W^{[\ell]}} = \frac{1}{m}\cdot \frac{\partial \mathcal L }{\partial U^{[\ell]}}\cdot Y^{[\ell-1]T}}$

$\boxed{\frac{\partial \mathcal L}{\partial U^{[\ell]}} = \brac{ W^{[\ell +1]T} \cdot \frac{\partial \mathcal L}{\partial U^{[\ell+1]}} }* \Phi^{[\ell]}{}'(U^{[\ell]})}$ where

$*$ denotes entrywise product of matrices.

$\boxed{\frac{\partial \mathcal L}{ \partial Y^{[\ell - 1]}} = W^{[\ell]T}\cdot \frac{\partial \mathcal L}{\partial U^{[\ell]}}}$ The last two yield the following for

$\ell<L$ the hidden layer and for

$\Phi^{[\ell]}:\R\to \R$ the activation function at

$\ell$ -th layer.

$\boxed{\frac{\partial \mathcal L}{\partial U^{[\ell]}} = \frac{\partial \mathcal L}{\partial Y^{[\ell]}} * \Phi^{[\ell]}{}'(U^{[\ell]})}$ and finally

$\boxed{\frac{\partial \mathcal L}{\partial b^{[\ell]} } =\frac{1}{m}\cdot \sum_{i=1}^m \frac{\partial \mathcal L}{\partial u^{[\ell](i)}} =\frac{1}{m}\cdot \text{np.sum}\brac{\frac{\partial \mathcal L}{\partial U^{[\ell]}},\text{axis = 1}}}$ For derivation of these formulas one can visit my another post: https://checkerlee.blogspot.com/2019/11/important-formulas-in-backward.html#more

Saturday, September 26, 2020

Intutive derivation of Cross Entropy as a "loss" function

In defining "loss" function for classification problems given $p_i=\mathbb P\{\text{$i$ occurs}\}$ , $i=1,2,\dots,n$ , from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) $[q_1,q_2,\dots,q_n]$ by the cross-entropy: $L=\sum_{i=1}^n p_i\ln q_i.$ Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!

Let's take an example, consider flipping a coin with getting a head with probability $p$ and tail with $1-p$ , then the probability of getting 2 heads out of 6 flipping is $L(p) = \binom{6}{2} p^2 (1-p)^4 = 15 p^2(1-p)^4.$ Maximum-likelihood estimation ask the following problem:

The phenomenon of getting 2 heads is most likely to happen under what value of $p$ ?

In other words, the above question is the same as at what value of $p$ the proability $L(p)$ gets maximized? By simply solving $L'(p)=0$ we get the answer $p=\frac{1}{3}$ .

But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability $p$ directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value $p$ is considered as a good estimation.

Now the derivation of cross-entropy will be very intuitive: Assume that $\text{mutally disjoint }E_i=\{\text{$i$ occurs}\},\quad \mathbb P(E_i) = p_i, \quad i=1,2,3,\dots,n.$ And assume further that $E_i$ 's are iid events. Consider events $A_1,\dots,A_N$ are such that $A_i = \cupp_{i=1}^n E_i$ for each $i$ (for example, flipping coins $N$ times), then $p_i = N_i/N$ , where $N_i$ is the number of times $i$ occures among $A_1,\dots,A_N$ .

Now we get another estimation $q_i$ of the same event $E_i$ from what ever experiment we can imagine. How good is $[q_1,\dots,q_n]$ as an estimation to the past emperical data $[p_1,\dots,p_n]$ ? The standard distance in $\R^n$ is certainly not a good choice since an quantity $\epsilon$ from $q_i$ to $p_i$ can mean huge difference from $q_{i'}$ to $p_{i'}$ . $[q_1,\dots,q_n]$ is considered as good estimation if the observed phenomenon $\{\text{1 appears $N_1$ times}\}, \quad \{\text{2 appears $N_2$ times}\},\quad \dots ,\quad \{\text{n appears $N_n$ times} \}$ is very likely to happen under the estimates $[q_1,\dots,q_n]$ , i.e., when $L = \prod_{i=1}^N q_i^{N_i}\iff \frac{\ln L}{N}= \sum_{i=1}^N \frac{N_i}{N}\ln q_i = \sum_{i=1}^n p_i\ln q_i.$ is large, and we have derived the cross-entropy at this point.

Sunday, September 20, 2020

Useful command in git review course:

1	`git rm --cached -r build/`

If we type git rm -h we get an explanaation that:

1	`--cached only remove from the index`

by index it means tracked files inside the staging area (files are always tracked once they get committed.) A gitignore has no effect to files that is already tracked, so we use git rm --cached.

Pages