In defining "loss" function for classification problems given p_i=\mathbb P\{\text{$i$ occurs}\}, i=1,2,\dots,n, from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) [q_1,q_2,\dots,q_n] by the cross-entropy: L=\sum_{i=1}^n p_i\ln q_i. Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!
Let's take an example, consider flipping a coin with getting a head with probability p and tail with 1-p, then the probability of getting 2 heads out of 6 flipping is
L(p) = \binom{6}{2} p^2 (1-p)^4 = 15 p^2(1-p)^4.
Maximum-likelihood estimation ask the following problem:
The phenomenon of getting 2 heads is most likely to happen under what value of p?
In other words, the above question is the same as at what value of p the proability L(p) gets maximized? By simply solving L'(p)=0 we get the answer p=\frac{1}{3}.
But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability p directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value p is considered as a good estimation.
Now the derivation of cross-entropy will be very intuitive: Assume that
\text{mutally disjoint }E_i=\{\text{$i$ occurs}\},\quad \mathbb P(E_i) = p_i, \quad i=1,2,3,\dots,n.
And assume further that E_i's are iid events. Consider events A_1,\dots,A_N are such that A_i = \cupp_{i=1}^n E_i for each i (for example, flipping coins N times), then p_i = N_i/N, where N_i is the number of times i occures among A_1,\dots,A_N.
Now we get another estimation q_i of the same event E_i from what ever experiment we can imagine. How good is [q_1,\dots,q_n] as an estimation to the past emperical data [p_1,\dots,p_n]? The standard distance in \R^n is certainly not a good choice since an quantity \epsilon from q_i to p_i can mean huge difference from q_{i'} to p_{i'}. [q_1,\dots,q_n] is considered as good estimation if the observed phenomenon
\{\text{1 appears $N_1$ times}\}, \quad \{\text{2 appears $N_2$ times}\},\quad \dots ,\quad \{\text{n appears $N_n$ times} \}
is very likely to happen under the estimates [q_1,\dots,q_n], i.e., when
L = \prod_{i=1}^N q_i^{N_i}\iff \frac{\ln L}{N}= \sum_{i=1}^N \frac{N_i}{N}\ln q_i = \sum_{i=1}^n p_i\ln q_i.
is large, and we have derived the cross-entropy at this point.