In defining "loss" function for classification problems given $p_i=\mathbb P\{\text{$i$ occurs}\}$, $i=1,2,\dots,n$, from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) $[q_1,q_2,\dots,q_n]$ by the cross-entropy: \[L=\sum_{i=1}^n p_i\ln q_i.\] Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!
Let's take an example, consider flipping a coin with getting a head with probability $p$ and tail with $1-p$, then the probability of getting 2 heads out of 6 flipping is \[
L(p) = \binom{6}{2} p^2 (1-p)^4 = 15 p^2(1-p)^4.
\] Maximum-likelihood estimation ask the following problem:
The phenomenon of getting 2 heads is most likely to happen under what value of $p$?
In other words, the above question is the same as at what value of $p$ the proability $L(p)$ gets maximized? By simply solving $L'(p)=0$ we get the answer $p=\frac{1}{3}$.
But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability $p$ directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value $p$ is considered as a good estimation.
Now the derivation of cross-entropy will be very intuitive: Assume that \[
\text{mutally disjoint }E_i=\{\text{$i$ occurs}\},\quad \mathbb P(E_i) = p_i, \quad i=1,2,3,\dots,n.
\]
And assume further that $E_i$'s are iid events. Consider events $A_1,\dots,A_N$ are such that $A_i = \cupp_{i=1}^n E_i$ for each $i$ (for example, flipping coins $N$ times), then $p_i = N_i/N$, where $N_i$ is the number of times $i$ occures among $A_1,\dots,A_N$.
Now we get another estimation $q_i$ of the same event $E_i$ from what ever experiment we can imagine. How good is $[q_1,\dots,q_n]$ as an estimation to the past emperical data $[p_1,\dots,p_n]$? The standard distance in $\R^n$ is certainly not a good choice since an quantity $\epsilon$ from $q_i$ to $p_i$ can mean huge difference from $q_{i'}$ to $p_{i'}$. $[q_1,\dots,q_n]$ is considered as good estimation if the observed phenomenon \[
\{\text{1 appears $N_1$ times}\}, \quad \{\text{2 appears $N_2$ times}\},\quad \dots ,\quad \{\text{n appears $N_n$ times} \}
\] is very likely to happen under the estimates $[q_1,\dots,q_n]$, i.e., when \[
L = \prod_{i=1}^N q_i^{N_i}\iff \frac{\ln L}{N}= \sum_{i=1}^N \frac{N_i}{N}\ln q_i = \sum_{i=1}^n p_i\ln q_i.
\] is large, and we have derived the cross-entropy at this point.