When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.
How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of y=(y1,y2,..yn) variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.
I was interested in to very often used loss functions: square loss and absolute loss.
Square loss
Square loss function is defined as f(y,c)=(y−c)2=n∑i=1(yi−c)2 and what we are searching for is the constant c for which the value of loss function is minimal.: c=argmincn∑i=1(yi−c)2 Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. ∂∂c[n∑i=1(yi−c)2]=0 We calculate ∂∂c[n∑i=1(yi−c)2]=n∑i=1∂∂c[(yi−c)2]=n∑i=12(yi−c)∂∂c(yi−c)=n∑i=1−2(yi−c)=0 n∑i=1yi=n∑i=1c=nc c=∑ni=1yin=¯y
Absolute loss
Absolute loss function is defined as f(y,c)=|y−c|=n∑i=1|yi−c|. Again, what we are searching for is the constant c for which the value of loss function is minimal.: c=argmincn∑i=1|yi−c| We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.
We consider without loose of generality that y variables are sorted (y1≤y2≤...≤yn). We are interested first to see what happens when c<y1. We calculate f(y,c)c<y1=n∑i=1|yi−c|=n∑i=1|yi+c+y1−y1|=n∑i=1|y1+c+yi−y1|=n∑i=1|y1+c|+n∑i=1|yi−y1|≥n∑i=1|yi−y1|=f(y,c)c=y1 So f(y,c)c<y1≥f(y,c)c=y1 In a similar manner we show also that f(y,c)c>yn≥f(y,c)c=yn What we found is that to minimize loss absolute loss, c must lie between y1 and yn inclusive. Next we note that, when c is in this interval, the distance to both ends is constant |y1−c|+|yn−c|=|yn−y1| We simply do not consider anymore those two points (y1,yn), we reduce our interest on interval y2,yn−1 and apply recursively the same process.
Finally we will find either a point in the middle, or an interval with no other y values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.
No comments:
Post a Comment