In defining "loss" function for classification problems given pi=P{i occurs}, i=1,2,…,n, from emperical data, we measure the accuracy of estimated data (from our output layer in neuron network) [q1,q2,…,qn] by the cross-entropy: L=n∑i=1pilnqi. Recently I revisit this topic, and understand that this comes very naturally from solving maximum-likelihood estimation problem!
Let's take an example, consider flipping a coin with getting a head with probability p and tail with 1−p, then the probability of getting 2 heads out of 6 flipping is L(p)=(62)p2(1−p)4=15p2(1−p)4. Maximum-likelihood estimation ask the following problem:
The phenomenon of getting 2 heads is most likely to happen under what value of p?
In other words, the above question is the same as at what value of p the proability L(p) gets maximized? By simply solving L′(p)=0 we get the answer p=13.
But in more complex problem we could not have the probability of some phenomenon to occurs based on another probablity with explicit formula. Instead of computing the probability p directly, we try to estimate it such that our observation (the phenomenon from empirical data) is most likely to occur, and such an estimated value p is considered as a good estimation.
Now the derivation of cross-entropy will be very intuitive: Assume that mutally disjoint Ei={i occurs},P(Ei)=pi,i=1,2,3,…,n.
And assume further that Ei's are iid events. Consider events A1,…,AN are such that Ai=⋃ni=1Ei for each i (for example, flipping coins N times), then pi=Ni/N, where Ni is the number of times i occures among A1,…,AN.
Now we get another estimation qi of the same event Ei from what ever experiment we can imagine. How good is [q1,…,qn] as an estimation to the past emperical data [p1,…,pn]? The standard distance in Rn is certainly not a good choice since an quantity ϵ from qi to pi can mean huge difference from qi′ to pi′. [q1,…,qn] is considered as good estimation if the observed phenomenon {1 appears N1 times},{2 appears N2 times},…,{n appears Nn times} is very likely to happen under the estimates [q1,…,qn], i.e., when L=N∏i=1qNii⟺lnLN=N∑i=1NiNlnqi=n∑i=1pilnqi. is large, and we have derived the cross-entropy at this point.
No comments:
Post a Comment