When we do a regression with a decision tree like CART, as part of the learning bias is the fact that we choose to approximate values from a region with a single constant value. Basically what decision trees does in case of regression is to recursively split the p-dimensional space into regions. Than for each region it selects the corresponding instances and using the class variable values from subset, it calculates a constant value used later for prediction.
How CART and its friends works, regarding the splitting process is not the subject here. What I want to talk about is how to find a proper constant value for a region. To reformulate in simpler terms, I am interested in how to calculate a constant value usable for prediction, given a set of \(y=(y_1,y_2,..y_n)\) variables. In order to obtain that we use a loss function. A loss function is a function which establish a certain criteria for choosing the proper constant values.
I was interested in to very often used loss functions: square loss and absolute loss.
Square loss
Square loss function is defined as $$ f(y,c)=(y-c)^2=\sum_{i=1}^{n}(y_i-c)^2 $$ and what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{(y_i-c)^2}$$ Here we find an simple way to calculate using the fact that the function is positive. Based on that we know that the function has a minimum and that minimum is given by derivative equals 0. $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}]=0 $$ We calculate $$ \frac{\partial}{\partial{c}}[\sum_{i=1}^{n}{(y_i-c)^2}] = \sum_{i=1}^{n}{\frac{\partial}{\partial{c}}{[(y_i-c)^2]}} = \sum_{i=1}^{n}{2(y_i-c)\frac{\partial}{\partial{c}}(y_i-c)} = \sum_{i=1}^{n}{-2(y_i-c)} = 0 $$ $$ \sum_{i=1}^{n}{y_i} = \sum_{i=1}^{n}{c} = nc $$ $$ c = \frac{\sum_{i=1}^{n}{y_i}}{n} = \overline{y} $$
Absolute loss
Absolute loss function is defined as $$ f(y,c)=|y-c|=\sum_{i=1}^{n}|y_i-c| $$. Again, what we are searching for is the constant \( c \) for which the value of loss function is minimal.: $$ c = \underset{c}{\arg\min} \sum_{i=1}^n{|y_i-c|}$$ We know that function is positive so we know also that is has a minimum somewhere. This time we can't use derivation. I read somewhere that c is the median, and that is statistically correct. I do not know what is the meaning of statistically correct, and that sounds fuzzy to my. After few tries I saw a way to proof that. I hope some day I will learn about a shorter proof.
We consider without loose of generality that \(y\) variables are sorted (\(y_1 \leq y_2 \leq ... \leq y_n\)). We are interested first to see what happens when \(c < y_1\). We calculate $$ \underset{c<{y_1}}{f(y,c)}=\sum_{i=1}^{n}{|y_i-c|}=\sum_{i=1}^{n}{|y_i+c+y_1-y_1|}=\sum_{i=1}^{n}{|y_1+c+y_i-y_1|}=\sum_{i=1}^{n}{|y_1+c|}+\sum_{i=1}^{n}{|y_i-y_1|} \geq \sum_{i=1}^{n}{|y_i-y_1|}=\underset{c={y_1}}{f(y,c)}$$ So $$ \underset{c<{y_1}}{f(y,c)} \geq \underset{c={y_1}}{f(y,c)}$$ In a similar manner we show also that $$ \underset{c>{y_n}}{f(y,c)} \geq \underset{c={y_n}}{f(y,c)}$$ What we found is that to minimize loss absolute loss, \(c\) must lie between \(y_1\) and \(y_n\) inclusive. Next we note that, when \(c\) is in this interval, the distance to both ends is constant $$ |y_1-c|+|y_n-c|=|y_n-y_1| $$ We simply do not consider anymore those two points (\(y_1,y_n\)), we reduce our interest on interval \(y_2,y_{n-1}\) and apply recursively the same process.
Finally we will find either a point in the middle, or an interval with no other \(y\) values included. If we found a point than this is our solution. If we found an interval, any point from that interval is a possible solution. We note that the definition of the median fits those conditions and is a solution which minimize absolute loss function.
No comments:
Post a Comment