Processing math: 100%

Saturday, September 26, 2020

Intutive derivation of Cross Entropy as a "loss" function

In defining "loss" function for classification problems given pi=P{i occurs}, i=1,2,,n,  from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) [q1,q2,,qn] by the cross-entropy: L=ni=1pilnqi. Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!

        Let's take an example, consider flipping a coin with getting a head with probability p and tail with 1p, then the probability of getting 2 heads out of 6 flipping is L(p)=(62)p2(1p)4=15p2(1p)4. Maximum-likelihood estimation ask the following problem:

The phenomenon of getting 2 heads is most likely to happen under what value of p?

In other words, the above question is the same as at what value of p the proability L(p) gets maximized? By simply solving L(p)=0 we get the answer p=13.

        But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability p directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value p is considered as a good estimation.

        Now the derivation of cross-entropy will be very intuitive: Assume that mutally disjoint Ei={i occurs},P(Ei)=pi,i=1,2,3,,n. And assume further that Ei's are iid events. Consider events A1,,AN are such that Ai=ni=1Ei for each i (for example, flipping coins N times), then pi=Ni/N, where Ni is the number of times i occures among A1,,AN.

        Now we get another estimation qi of the same event Ei from what ever experiment we can imagine. How good is [q1,,qn] as an estimation to the past emperical data [p1,,pn]? The standard distance in Rn is certainly not a good choice since an quantity ϵ from qi to pi can mean huge difference from qi to pi. [q1,,qn] is considered as good estimation if the observed phenomenon {1 appears N1 times},{2 appears N2 times},,{n appears Nn times} is very likely to happen under the estimates [q1,,qn], i.e., when L=Ni=1qNiilnLN=Ni=1NiNlnqi=ni=1pilnqi. is large, and we have derived the cross-entropy at this point.

No comments:

Post a Comment