\( \newcommand{\N}{\mathbb{N}} \newcommand{\R}{\mathbb{R}} \newcommand{\C}{\mathbb{C}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\P}{\mathcal P} \newcommand{\B}{\mathcal B} \newcommand{\F}{\mathbb{F}} \newcommand{\E}{\mathcal E} \newcommand{\brac}[1]{\left(#1\right)} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\matrixx}[1]{\begin{bmatrix}#1\end {bmatrix}} \newcommand{\vmatrixx}[1]{\begin{vmatrix} #1\end{vmatrix}} \newcommand{\lims}{\mathop{\overline{\lim}}} \newcommand{\limi}{\mathop{\underline{\lim}}} \newcommand{\limn}{\lim_{n\to\infty}} \newcommand{\limsn}{\lims_{n\to\infty}} \newcommand{\limin}{\limi_{n\to\infty}} \newcommand{\nul}{\mathop{\mathrm{Nul}}} \newcommand{\col}{\mathop{\mathrm{Col}}} \newcommand{\rank}{\mathop{\mathrm{Rank}}} \newcommand{\dis}{\displaystyle} \newcommand{\spann}{\mathop{\mathrm{span}}} \newcommand{\range}{\mathop{\mathrm{range}}} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\innerr}[1]{\left\langle #1 \right \rangle} \newcommand{\ol}[1]{\overline{#1}} \newcommand{\toto}{\rightrightarrows} \newcommand{\upto}{\nearrow} \newcommand{\downto}{\searrow} \newcommand{\qed}{\quad \blacksquare} \newcommand{\tr}{\mathop{\mathrm{tr}}} \newcommand{\bm}{\boldsymbol} \newcommand{\cupp}{\bigcup} \newcommand{\capp}{\bigcap} \newcommand{\sqcupp}{\bigsqcup} \newcommand{\re}{\mathop{\mathrm{Re}}} \newcommand{\im}{\mathop{\mathrm{Im}}} \newcommand{\comma}{\text{,}} \newcommand{\foot}{\text{。}} \)

Sunday, November 10, 2019

On Logistic Regression

Denote $\sigma(z)$ the sigmoid function defined by $z\mapsto 1/(1+e^{-z})$, for a given feature $X\in \R^n$ the estimator of $\mathbb{P}(y = 1 |X)$  is given by \[\hat y= a =
\sigma(w^TX+b),
\] where $w\in \R^n$ and $b\in \R$.

Given training examples $\{(X^{(i)}, y^{(i)}\in \{0,1\}):i=1,2,\dots,m\}$ we define \[
\R^m \ni A = \begin{bmatrix} a^{(1)}&\cdots & a^{(m)} \end{bmatrix}^T=\begin{bmatrix}\sigma(w^TX^{(1)}+b)&\cdots &\sigma(w^TX^{(m)}+b) \end{bmatrix}^T \tag*{($*$)}
\] the vector (stacked results) of estimated probability of "positive results" of each training example and \[
Y = \begin{bmatrix}y^{(1)}&\cdots &y^{(m)} \end{bmatrix}^T
\] the vector (stacked results) of the truth values from the training examples.

The cost function in logistic regression (given $m$ training exmaples) is given by \[
J = J(w,b)= \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)}),
\] where \[
\mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \ln a^{(i)} - (1-y^{(i)} )  \ln(1-a^{(i)}).
\] and $a^{(i)}$'s are defined above in ($*$). By minimizing this cost function we can solve out $w$ and $b$, then we get a simple "trained machine". To get the minimization result we apply the gradient decent: \[
\begin{align*}
w_i &:= w_i -\alpha \frac{\partial J}{\partial w_i}(w_i,b),\\
b &:=   b-\alpha \frac{\partial J}{\partial b}(w_i,b),
\end{align*}
\] by direct computation we can easily stack up the above result to get \[
\brac{\frac{\partial J}{\partial w} (w,b)}^T: = \brac{\begin{bmatrix}
\dis\frac{\partial J}{\partial w_1}&\cdots &\dis \frac{\partial J}{\partial w_n}
\end{bmatrix}}^T(w,b) = \frac{1}{m} \sum_{i=1}^m e_i \underbrace{(e_i^T X)(A-Y)}_{=\partial J /\partial w_i (w,b)} =  \frac{1}{m}X(A-Y) ,
\] where $e_i$'s denote the standard basis of $\R^n$ (recall that $n=n_x$ is the dimension of the feature $X$), which in numpy can be written as
m = X.shape[1]
dJ_dw = 1/m * np.dot(X.reshape(-1, m), A.reshape(-1, 1)-Y.reshape(-1, 1)).
Note that the RHS above is an $n\times 1$ vector. Indeed in this article we strictly follow the convention in linear algebra (mathematical aspect) that any element in $\R^n$ is represented as a column. Similarly for $\frac{\partial J}{\partial b}$ we also get \[
\frac{\partial J}{\partial b} = \frac{1}{m}\sum_{i=1}^m (a^{(i)} - y^{(i)})
\] which can be computed in python by
 dJ_db = 1/m * np.sum(A-Y)

No comments:

Post a Comment