Examples of invalidation of regression tree minimization criterion
Simulation experiment. We assume \(x=1,1.2,1.4, \cdots 99.8,100\), according to,
$$\begin{gathered} y\left| x \right. \leqslant 25\sim N\left( {\mu =1,\sigma =0.1} \right) \hfill \\ y\left| x \right.>25\sim N\left( {\mu =1.1,\sigma =0.1} \right) \hfill \\ \end{gathered}$$
1
The response observation value Y is generated. Among them, \(N\left( {\mu ,\sigma } \right)\) represents a normal distribution with mean \(\mu\) and standard deviation \(\sigma\). We try to use the quantile criterion and minimize variance criterion formula designed in this paper to get the quantile of the simulated data. When the conditional distribution \(y\left| x \right.>\theta ,y\left| x \right. \leqslant \theta ,\) satisfies the normal distribution, the mean value is not much different, and the standard deviation is 10 times different, the use of the minimized variance criterion is even worse than the result of human eye observation. Therefore, this article believes that it is necessary to improve, and analyzes the reasons for the wrong results of the minimization criteria for quantile selection.
It further gives a comparison of simulation experiments under different mean \(\mu\) and standard deviation \(\sigma\), among them,
$$\begin{gathered} y\left| x \right. \leqslant 25\sim N\left( {{\mu _0},{\sigma _0}} \right) \hfill \\ y\left| x \right.>25\sim N\left( {{\mu _1},{\sigma _1}} \right) \hfill \\ \end{gathered}$$
2
Normal information gain criterion
First, the calculation formula of continuous variable information entropy is given. We assume that \(y\sim p\left( y \right)\) and \(p\left( y \right)\) are the density functions of Y. When y is distributed normally, the density function is:
$$p\left( {y,\mu ,\sigma } \right)=\frac{1}{{\sqrt {2\pi {\sigma ^2}} }}\exp \left\{ { - \frac{1}{{2{\sigma ^2}}}{{\left( {y - \mu } \right)}^2}} \right\}$$
3
The definition of continuous variable information entropy is:
$$Ent\left( y \right)= - \int {p\left( y \right){{\log }^{p\left( y \right)}}} dy$$
4
Lemma1
We assume \(y\sim N\left( {\mu ,\sigma } \right)\), then the information entropy of y is:
$$\frac{1}{2}{\log ^{2\pi }}+\frac{1}{2}{\log ^{{\sigma ^2}}}+\frac{1}{2}$$
5
Since the first and last terms are constants, \({\log ^{{\sigma ^2}}}\) can be used to represent their relative size.
prove:
$$\begin{gathered} Ent\left( y \right)= - \int {\frac{1}{{\sqrt {2\pi {\sigma ^2}} }}\exp \left\{ { - \frac{1}{{2{\sigma ^2}}}{{\left( {y - \mu } \right)}^2}} \right\}} \hfill \\ {\log ^{\frac{1}{{\sqrt {2\pi {\sigma ^2}} }}\exp \left\{ { - \frac{1}{{2{\sigma ^2}}}{{\left( {y - \mu } \right)}^2}} \right\}}}dy \hfill \\ = - {\log ^{\frac{1}{{\sqrt {2\pi {\sigma ^2}} }}}}+\frac{1}{2} \hfill \\ \end{gathered}$$
6
Thus, the information entropy based on the normal distribution can be defined as:
$$Ent\left( D \right)={\log ^{{{\hat {\sigma }}^2}}}$$
7
Among them, \({\hat {\sigma }^2}\) is the estimated value of the variance of the response variable, generally taken as:
$${\hat {\sigma }^2}=\frac{1}{N}\sum {_{{i=1}}^{N}{{\left( {{y_i} - \bar {y}} \right)}^2}}$$
8
If it is a weighted sample, we have weights
$$\begin{gathered} W=diag\left( {{w_1}, \cdots ,{w_N}} \right), \hfill \\ \sum {_{{i=1}}^{N}{w_i}=1} \hfill \\ \end{gathered}$$
9
Then, there is:
$${\hat {\sigma }^2}=\sum {_{{i=1}}^{N}{w_i}{{\left( {{y_i} - \bar {y}} \right)}^2}}$$
10
Since the total information entropy of the marginal distribution is the same, the information gain brought by the variable: quantile (key: value) can be calculated by the following formula,
$$\begin{gathered} Gain\left( {D,key:value} \right)= - \frac{{\left| {{D_{ \leqslant value}}} \right|}}{{\left| D \right|}}Ent\left( {{D_{ \leqslant value}}} \right) \hfill \\ - \frac{{\left| {{D_{>value}}} \right|}}{{\left| D \right|}}Ent\left( {{D_{>value}}} \right) \hfill \\ \end{gathered}$$
11
The bias analysis assumes that the covariate: site (key: value) is used to divide the sample set into \({D_{ \leqslant value}},{D_{>value}}\),The proportion of the total sample is \({p_0},{p_1}\) respectively, which satisfies:
We assume that the variance of the true conditional distribution \(y\left| {key \leqslant value,} \right.y\left| {key>value} \right.\) satisfies:
$$\sigma _{0}^{2}=2,\sigma _{1}^{2}=\sigma _{0}^{6}=8$$
13
Minimize variance criterion is considered.
$$Va{r_{index}}\left( {D,key:value} \right)={p_0}\sigma _{0}^{2}+{p_1}\sigma _{1}^{2}$$
14
When the difference between the mean \({\mu _0},{\mu _1}\) of the true conditional distribution is very small, in order to minimize the variance, \({p_1}\) will approach 0, and the proportion \({p_1}\) of \({D_{>value}}\) will be very small. At the same time, since \(\sigma _{1}^{2}\) is 4 times that of \(\sigma _{0}^{2}\), the strength of the quantile being raised can be approximately regarded as 4. Moreover, it corresponds to the normal information gain criterion.
$$- Gain\left( {D,key:value} \right)={p_0}{\log ^{\sigma _{0}^{2}}}+{p_1}{\log ^{\sigma _{1}^{2}}}$$
15
The coefficient of \({p_1}\) is only 3 times that of the \({p_0}\) coefficient, that is, the strength of the quantile being raised can be approximately regarded as 3. This difference is more obvious when the variance difference is large. For example, \(\sigma _{0}^{2}=2,\sigma _{1}^{2}=\sigma _{0}^{{10}}=32\), for the principle of minimizing variance, the quantile elevation force is 16, and the quantile elevation force selected by the normal information gain is 5. This is also the reason why the quantile 67.586 obtained by the criterion of minimizing variance is much higher than the true value when the variance difference is large in the simulation experiment. When the mean difference is large, the correction effect of the mean is more obvious, so the two quantile selection strategies can approximate the true value.