Improving boosting methods with a stable loss function handling outliers

In classification problems, the occurrence of abnormal observations is often encountered. How to obtain a stable model to deal with outliers has always been a subject of widespread concern. In this article, we draw on the ideas of the AdaBoosting algorithm and propose a asymptotically linear loss function, which makes the output function more stable for contaminated samples, and two boosting algorithms were designed based on two different way of updating, to handle outliers. In addition, a skill for overcoming the instability of Newton’s method when dealing with weak convexity is introduced. Several samples, where outliers were artificially added, show that the Discrete L-AdaBoost and Real L-AdaBoost Algorithms find the boundary of each category consistently under the condition where data is contaminated. Extensive real-world dataset experiments are used to test the robustness of the proposed algorithm to noise.


Introduction
For classification problems, we know from Bayes' theorem that all we need is P(y = j|x) , the posterior or conditional class probability. Just note that E 1 y=1 |x = P(y = j|x) , where 1 y=j is the 0/1 indicator variable representing class j, could transfer lots of regression methods towards the classification domain. A well-known example is E 1 y=j where linear regression is used for the estimation and the response limited to [0, 1] is ignored. While this works well in the general case, several issues have been noticed with the constrained regression problem [1], as often the response variable is not scoped for regression methods. A popular method used in statistics to overcome these problems is the logit transformation. For a binary classification problem, where y ∈ {−1, 1} , the logistic model has the form The monotone logit transformation on the left guarantees that for any values of F(x) ∈ ℝ , the probability estimation lies in [0, 1] . Therefore, many loss functions dealing with classification problems have been proposed. The exponential criterion E e −yF(x) , which appeared in [2][3][4], is often used to solve classification problems. In this paper, a robust criterion is suggested for dealing with long-tail error outliers. For a long time, boosting has been criticized for its sensitivity to noise. Various loss functions (e.g., LogitBoost [2], NDAdaBoost [5], CBAdaBoost [6], SML-Boost [7], ARABUTWSVM [8] and RAEOCSVMs [9]) have been proposed to deal with this issue. In general, classification tasks are trying to map feature variables to response labels. In practice, such a mapping will not be deterministic, that's, instances with the same feature may belong to completely (1) log P(y = 1|x) P(y = −1|x) = F(x).
Li Bo, Wang Lei and Peng Pai contributed equally to this work.
1 3 different categories. Due to the limitation of measurement methods and the complexity of the factors that affect the classification results in reality, there are always various uncertainties in the response labels. In tasks such as credit risk assessment, profit forecasting, etc., the factors affecting the results are uncertain. Experimenters can only make measurements based on general experience. This results in many anomalies in the observed sample, while the true generality is often hidden. In such classification tasks, boosting algorithms perform quite well. AdaBoost [10], the prominent member of boosting algorithms, establishes the promotion direction of the boosting algorithm, that's, training a new membership predictor with reweighted samples in each step. In general, increase the weight of misclassified samples, and reduce the weight of correctly classified samples [11,12]. However, when such misclassified samples are themselves a kind of noise, this way of increasing the weight of mispredicted samples is very problematic. To redistribute weights more wisely, several adaboost algorithms based on noise detection have been proposed successively, such as NDAdaboost, SML-Boost. NDadaboost uses the k-NN predictor to identify noise points. Sam believes that the prediction error rate of points near noise points is always large, so that a target point can be detected based on the prediction performance of existing predictors at nearby points. SML-Boost treats mispredicted points far from the classification boundary as noise points.
Only mispredicted points close to the boundary should be boosted, while those far from the boundary are most likely noise. We will see that this noise detection based Adaboost algorithm can only perform well if the dataset satisfies its noise detection assumptions. When the noise detection assumption does not hold, the algorithm often performs unsatisfactorily. Another clue to improving the robustness of the algorithm comes from the huber [13,14], which gives larger but upper bound weights to mispredicted samples, regardless of whether they are noise or not. Kanamori proposed a conditional probability P(y|x) for contaminated samples in their robust loss function discussion [15]. In forward, Dang proposed a label-confidence based boosting algorithm (CBAdaBoost). This is essentially a tradeoff between increasing the weight of mispredicted samples and discarding mispredicted samples. When the cost of adding weight is higher than the cost of discarding, the algorithm discards the sample. The loss function proposed in this paper is also based on weight balance. Unlike the original Adaboost that blindly increased the weight of the wrongly predicted samples, we set an upper bound on the weight while giving more weight to the mispredicted samples. This manifests itself as an asymptotically linear loss function in the model. When its gradient is applied to the weight redistribution, it can give a reasonable new weight distribution to the wrongly predicted samples and the correct predicted samples in a balanced manner. This mechanism of restraint against mispredicted samples enables the model to remain calm enough in the face of noise. We found that the negative binomial log-likelihood not only can effectively deal with the classification problem, whose population solution coincident with model (1),but also retains the robustness of M-estimation [16]. Our main contributions are as follows: • We proposed a loss function to deal with classification problems, where abnormal observations occurs. By comparison, we analyzed the robustness of the proposed loss function. • Under the assumption that the model is additive, we derived two iterative algorithms, which will generate two boosting algorithms for classification tasks. • Simulation experiments and real data experiments are designed to verify the effectiveness of the algorithm. Through simulations, we found that the proposed algorithms behave more calmly in the face of outliers.

Model construction
We consider the following stochastic optimization problems: where F ∶ ℝ p → ℝ is a prediction function, J(F) stands for the loss function of F, y ∈ {−1, 1} denotes the response variable, x ∈ X ⊂ ℝ p denote the predictors (or features), and the expectation E[⋅] is over the joint distribution of (y, x). The above minimization is commonly encountered in statistical learning. When F(x) is assumed to be a linear combination of x, as F(x) = T x , the problem turns into logical regression. But in this paper, we don't make any assumptions about the form of F(x), except for additivity, that is Generalized additive models (GAMs) is a very popular function estimation model for functional estimation [17][18][19].
With the assumption of additivity, estimating F(x) is equivalent to estimating f i (x) . Before giving the estimation of F(x), we shall explain what F(x) stands for or the significance of solving the optimization (2). We have Lemma 1.

Lemma 1
Let F pop (x) denote the true minimizer of the population risk given in (2), then we have From Lemma 1, if we have an estimate of F pop (x) , we could calculate P(y = 1|x) by (3). This gives the significance of solving the minimization (2), that is, providing us a method for classification problems. (i) First, get an estimate of F pop (x) , denoted by F (x) ; (ii) Then calculate P(y = 1|x) by (3). In two-class problems, only sign(F(x)) is needed, for P(y = 1|x) > 0.5 is equivalent to F pop (x) > 0 . This knowledge will be used for the prediction of y|x in the following algorithms.
The exact minimization of the stochastic optimization problem (2) requires the knowledge of the underlying distribution of the variables (y, x). In practice, however, the joint distribution of the pair is not available; therefore, after observing n independent data points {y i , x i } , one standard approach is to minimize the following surrogate of (2), often referred as empirical risk approximation In generalized linear models, F(x) is assumed to be a linear combination of x. In practice, the true minimizer F pop (x) is nearly impossible to be linear and any parameterized form of F(x) will greatly reduced the domain of F(x). Therefore, in this paper, the boosting method is chosen to solve this additive model. Boosting, originally proposed by [20] and [21], aims to generate a strong composite learner from a given class of weak learners. Some boosting algorithms can be explained in terms of gradient descent to solve optimization problems with different loss functions [22]. Since then, more and more boosting algorithms have been devised by solving different loss functions, including LS_Boost for least squares optimization, LAD_Tree-Boost for least absolute deviation, M_TreeBoost for Huber loss, and logistic binomial optimization L K _TreeBoost of the log-likelihood and so on. Recently, Lev. V designed a software reliability boosting method to avoid overfitting [23]. By comparison, we found that most of the boosting algorithms are more designed for regression problems than classification problems, for their optimization goals are not suitable for classification tasks. Therefore, the goal of this paper is to design a robust function estimation algorithm to deal with classification tasks, where abnormal observation occurs. In this paper, only additivity of F(x) is required. Notice that we do not make any assumption of the distribution of (y, x). The additive models have a long history in statistics, and so we give only two forms which we focus on. The discrete form is The difference between the discrete and real form additive models comes from the codomain of the components f m . The backfitting algorithm [24,25] is a convenient modular "Gauss-Seidel" algorithm for fitting additive models. On our minimization problem, the backfitting update has the form, Any method or algorithm for estimating a function of x can be used to obtain an estimate of the minimizer in (8). The methods include nonparametric algorithms, such as local regression, smoothing splines or tree regression. In the right-hand side, all the latest versions of the functions f m are used in forming the minimization problem. The backfitting cycles are repeated until convergence. Alternatively, one can use a "greedy" forward stepwise approach, where, This approach is used by [26] in matching pursuit. In this paper, we estimate F(x) by this forward stepwise approach.

Comparison and robust discussion
In this subsection, we give the reason of the suggestion, that's the negative log-likelihood criterion (2) is suitable for classification where long-tail error outlier occurs. Many generalization of Boosting algorithms have been proposed. Friedman (2000) suggests AdaBoost can be derived as a method for fitting an additive model ∑ m f m (x) in a forward stagewise manner [2]. Friedman (2001) derived a Gradient Boosting Machine, a general framework for function estimation [22]. As applications of Gradient Boosting, Least-squares regression (10), M-estimation (11) and negative binomial log-likelihood (12) were discussed. In Fig. 1, four kinds of loss were plot, where y ∈ {−1, 1} and yF(x) measures the consistence between y and F(x). For some > 0, We can see that (12) is equivalent to (2) up to a factor 2. In [22], only Newton's method was used for solving (12). Although Newton's method can achieve fast convergency in theory, it is easy to cause instability in the calculation, especially when the Decision Tree is used as the basic learner. And the discussion of robustness of these losses was ignored. For binomial classification, y takes value in {−1, 1} . We take the prediction mechanism to be sign(F(x)) . For a given F(x), yF(x) > 0 stands a correct prediction. The larger yF(x) means the better the estimation F(x), vice versa. Here are some discusses: (a) Least Square loss is often used to deal with regression problems. But for classification problems, the least square method will no longer be suitable. Because it imposes a greater penalty on samples that are classified too correctly. When yF(x) → ∞ , the Least Square loss grows. M-regression techniques attempt resistance to long-tailed error distributions and outliers while maintaining high efficiency for normally distributed errors. It also encounters the same problem of Least Square loss. Its advantage is that the way to increase the loss of misclassified samples is linear, which makes M-regression robust when dealing with long-tailed error outliers.
(b) M-regression uses absolute loss instead of square loss, which gives the same weight on the observations where |y − F| > , while takes square loss when |y − F| < . This is the main reason for success of Huber loss. But for classification problems, less penalty is needed for yF > 0 , as yF > 0 stands for correct prediction. As we can see, where y take value in {1, −1} . There is no special to make y = 1 and F > 0 , then yF > 0 always stands for correct prediction. But for large F, |y − F| > , Huber loss always takes penalty on those observations, which is not reasonable. We should make less penalty on the observations where yF > 0 . As our suggestion, the penalty goes as Although, Huber loss makes no suitable for classification problems. Its succession gives some enlighten, such as take the same penalty on the observations where |y − F| > . So in our suggestion, the penalty goes as For wrong predictions, which include noises, the penalty weight will no be larger than 1. This mechanism ensures that we avoid putting too much emphasis on noise. This property inherit the success of Huber loss.
(c) Exponential criterion appeared in [27,28]. Friedman (2000) shows that the AdaBoost algorithms, appeared in [29], can be interpreted as stagewise estimation procedures  (10), Huber Loss for (11) where = 1 , Exponent for e −yF , and Log likelihood for (2). All curves pass through point (0, 1), by panning up and down for fitting an additive logistic regression model [2]. The Exponential criterion avoids imposing the greater penalty on the too correct samples. But its punishment for errors increases exponentially, which makes it non-robust when dealing with long-tailed error outliers. In this paper, the log-likelihood loss (2) is suggested and discussed in detail. Because this loss avoids imposing greater penalty on the too correct samples. It inherits the advantage of Huber loss. When yF(x) → −∞ , which stands for wrong prediction, the loss function gradually becomes linear, log 1 + e yF(x) − yF(x) → −yF(x) , which makes the loss somewhat robust when dealing with long-tailed error outliers.
In the following sections, we derive the estimation algorithm of F(x). As F(x) is additive, we update it through the following two ways.
The two updating ways correspond to two different algorithms, namely, Discrete L-AdaBoost and Real L-AdaBoost, where L-stands for logit transformation.

Discrete L-AdaBoost
Suppose we have a current estimate F(x) and seek an improved estimate, (2).

Proposition 2 The Discrete L-AdaBoost algorithm (population version) builds an additive logistic regression model via Gradient descent updates for minimizing the negative log-likelihood criterion
since y 2 = 1 and f 2 (x) = 1 . We seek f(x) to minimize J(F + cf ) is equivalent to minimize this expectation, As yf (x) = − 1 2 (y − f (x)) 2 + 1 , the above minimization problem turns to be Taking derivation of the right-hand formula on c, the minimizer c satisfies After observing n points of datasets, c can be estimated by the root of the following equation, Newton's root-find method can be used for seeking the root. ◻ Gradient descent method is a very popular function estimation method [30,31]. In our derivation, the second-order Taylor expansion of the loss function is used. This is essentially using the gradient descent method, but the step size is determined by optimizing the objective function again. Summarizing the estimation process of F(x) from Proposition 2, we get Algorithm 1 for classification problems.
It should be noted how to estimate representing the expectation of y for a refined distribution. In practice, any method or algorithm for estimating a function of x can be used to obtain an estimate of the weighted conditional expectation in Eq. (14). Here we choose tree based classifier as an estimate of f(x), for it's widely used for basic functions [27,29,32].

Given a loss function J(F) ∶= E[L(yF)]
. According to Newton's method, F(x) can be updated by where L ′ , L ′′ represent the first and second derivatives of L, respectively. In our case, decision tree is used for the update, ) . For strictly convex optimization problems, this weighted expectation will not cause much trouble. But for the loss function in this paper, its second derivative tends to 0, when |F(x)| increases. Especially in M-regression, the second derivative is exactly 0 for |y − F| > . This greatly limits the scope of application of Newton's method. In fact, we find that although the second derivative tends to zero at some observations, its expectation E[L �� (yF)] tends to be much larger than zero. From the point of view of optimization, the expectation of the second derivative is used as a step to control the size of the gradient. Therefore, the update can be seen as, , then the trained decision tree is given a multiplier In this way, we not only take advantage of the fast convergence of Newton's method, but also avoid the instability caused by the second derivative too small.

Real L-AdaBoost
Suppose we have a current estimate F(x) and seek an improved estimate, like

Proposition 3
The Real L-AdaBoost algorithm fits an additive logistic regression model via Newton like updates and approximate optimization of (2).
Hence the Newton update is representing the expectation of x for a refined distribution. ◻ In Algorithm 2, we takes adaptive Newton steps for updating F(x). The main difference between this and the Discrete L-AdaBoost algorithm is how it uses its estimates of the weighted expectations to update the functions. With more detailed analysis, we found that the multiplier are updated differently. In Discrete L-AdaBoost, the multiplier

Multiclass case
For the multiclass case, several generalizations of Ada-Boost is presented in [27], among which their AdaBoost. MH seems dominate the others in the empirical studies. Therefore, we use AdaBoost.MH algorithm (Algorithm 3) here to handle the multiclass problems. The key to the algorithm is converting the J classes problem into fitting a two-class classifier J times. The reasonableness relies on that this algorithm minimizes the loss function , which is equivalent to minimizing each loss function E l y j , F j (x) separately. We call sample separable, if there exists a criterion that can separate each categories. For example,

3
In separable sample, outliers may come from measuring tools or manual record error. A stable classification algorithm should find a true criterion or a criterion close to the truth, under the interference of outliers. Another outliers occurrence comes from inseparable samples. For example, P(y = 1) = x and P(y Here, we can not find a criterion that separate y = 1 and y = −1 . By maximizing likelihood, we can still find a proper boundary for classification, that's y = 2 ⋅ 1 x>0.5 − 1 . In this situation, outliers come from the indistinguishability of each categories. Under the second scenario, the distribution of sample is very critical for finding a stable boundary, that should be close to the boundary of maximizing Likelihood.
In this section, we give several samples, where outliers occur for class labels, to show the fineness of Discrete L-Adaboost and Real L-Adaboost.
Example 1 (Binary Classification) Select 2000 points from two Gaussian distribution, 1000 from the each distribution. Assign the points, x 1 > 0 , as first category denoted as Y = 1 , and the points, x 1 ≤ 0 as second category denoted a Y = −1 . Then randomly select a certain proportion (noise) of instances from the second class, and their class labels are modified to 1. By setting the noise level to be 0.02, 0.1 and 0.2, we get three samples. See Fig. 2.
Example 2 (Three classification) Let (x 1 , x 2 ) scatter in an unit circle evenly, that's x 1 = r cos( ) , x 2 = r sin( ) , r ∈ (0, 1) , . Presume we have 15000 observations. Separate all points into three categories by r ∈ (0, 1 3 ] , r ∈ ( 1 3 , 2 3 ] and r ∈ ( 2 3 , 1] . Let G = 0, 1, 2 stands for categories. Select proportion of points from group G = 0, 2 , and assign them as group G = 1 . Then we get a contaminated data set, at first column of Fig. 3. In Example 1 and 3, the max_depth of base learner, Cart Tree, is set max_depth = 1 . In Example 2, the max_ depth is 2. For all classifiers, number of iterations is 1000. To test the immunity of outliers, we gradually increase the proportion of abnormalities. By increasing the level of noise, we found that most boosting algorithms are affected. We choose the XGBoost [33] algorithm as a typical comparison, as its popularity. Boosting algorithm is greedy, as it believes residual learnable theory [34]. This can be seen from Example 1 and 1, where XGBoost tries to learn noises. Our recommended algorithm behaves more calmly in the face of this kind of outliers. Example 3 shows the performance of each algorithm in multi-classification. DLAD and RLAD show strong perturbation resistance while accurately learning classification boundaries. XGBoost fails to deal with the balance between learning ability and anti-interference. When faced with unilateral intrusion samples, XGBoost moves its classification boundary closer to the noise inevitably.   35 real-world data from UCI [35] are used to test the effectiveness of the algorithms. Table 1 details the names and characteristics of the datasets, which includes 23 binary classification tasks and 12 multi-classification tasks. For samples with a very imbalanced number of labels, we remove labels that account for less than 10%, e.g., arrhythmia, band, contraceptive, heart, lymphography. In the pre-processing, 5-fold cross-validation method is used to generate training and testing datasets. The average of the testing scores is used as the final score. When improving boosting's robustness to noise, there exist two types of noises: abnormal in the feature value and abnormal in the class label. In our experiments, abormal in the class label is artificially added to the training dataset. Specifically, random perturbations are used to contaminate the training data before it is fed into the algorithms. The instances in the training dataset are grouped by labels. A fixed proportion of instances are drawn in each group and the class label are randomly switched. This fixed ratio is called noise. By changing the noise ratio, the anti-interference ability of the algorithms will be tested. For factor variables, e.g., color, shape, one-hot encoding is used to ensure that the data is presented in numerical form, thus enabling most algorithms to be applicable. We put all the code on https:// github. com/ WChao 1988/ DLAB.

Methods for comparisons
• LogitBoost was first proposed by Friedman [2]. By direct optimization of the binomial log-likelihood, Friedman proposed the LogitBoost from a statistical point of view. They calculate the residuals Then a function f m (x) is fitted by a weighted leastsquares regression of z i to x i using weights w i . • XGBoost is a highly effective and widely used boosting algorithm proposed by Chen [33]. It can greatly avoid overfitting through a large number of row and column sampling. • CBAdaBoost was first proposed by Xin Dang [6]. They argue that the labels may not be true when they are calibrated, and therefore propose a conditional-risk loss function, that's where R (⋅) stands for the empirical loss for f (⋅) and i stands for the probability that the instance i is not noise.  Adult  32561  13  2  Tic-tac-toe  958  9  2  Advertisements 3279  1558  2  ThoraricSurgery  470  17  2  Arrhythmia  452  279  2  Qualitative_Bankruptcy 250  7  2  Bands  499  29  2  Diabetic  1151  20  2  Breast-cancer  569  32  2  Chronic_kidney  400  25  2  Contraceptive  1473  9  2  Abalone  4177  8  3  Credit  690  15  2  Accelerometer  153000  4  3  Ecoli  336  8  2  Connect-4  67557  42  3  Glass  214  10  2  Hayes-Roth  160  5  3  Haberman  306  3  2  Iris  150  4  And treat the observed label y i as a fuzzy label with i correctness confidence. • NDAdaBoost is a noise-detection based AdaBoost algorithm proposed by Kanamori [15]. They made a noise detection before updating the weights. Only those points detected as non-noise will be weighted more. In their proposal, the k-NN model [36] is proposed for noise detection. The confusion of a point z i is calculated by where K t i represents a set including the k nearest neighbors z j = (x j , y j ) of z i at the t-th iteration, 1 (h t (x j )≠y j ) = 1 indicates the predictor h t (⋅) makes an incorrect judgement at point j, otherwise 1 (h t (x j )≠y j ) = 0 and knn t (z i ) caculates the probability that the instance z i is a noise point. • SML-Boost introduces a soft-margin like strategy in boosting [7]. This method selectively updates the weights of samples by setting a threshold to find those points that are closer to the boundary.   • DLAB / RLAB represent Discrete L-AdaBoost-Algorithm 1 and Real L-AdaBoost-Algorithm 2 suggested in this paper.
All the algorithms use the AdaBoost.MH when dealing with the multiclass problems. We train all learners until the train error converges, or iterative steps exceed 5000 steps.

Evaluation metrics
• Accuracy stands for the average accuracy of 5-fold cross-validation. • F1 rating stands for the harmonic mean of precision and recalling, where Here, #{⋅} stands for the cardinality of the set, N stands for sample size, label stands for set of categories, ŷ i stands for prediction and y i stands for the truth label.

Performance comparison
Tables 3, 4, 5 shows the prediction accuracy and F1-rating at noise level 0%, 10% and 25% . NDAdaBoost and SML-Boost are two noise-detection based algorithms. They underperformed on most datasets, but performed extremely well on a few datasets. This greatly limits the versatility of this noise Bold indicates the highest ratings and underline for the lowest  detection method. Only when the noise fits the assumptions, those algorithms perform well. We are not arguing that our model is absolutely dominant, better than others. In fact, XGBoost performs quite well when the noise level is low. However, with the increase of noise, XGBoost shows great greediness in the face of noise, and then tries to learn from the noise, which makes the predictor no longer robust. Figure 5 presents the average rankings of baseline algorithms on 35 datasets. From the average ranking, it can be seen that the LogitBoost and XGBoost algorithms have excellent performance on noise-free datasets. When the noise increases, the accuracy of XGBoost drops sharply. DLAD/RLAD show their robustness when the noise increases. CBAdaBoost and LogitBoost show some restraint in the face of noise, but they are inevitably affected by noise. Its accuracy decreases with increasing noise. Despite good performance on some datasets, NDAdaBoost and SML-Boost rank low overall, which may stem from their harsh assumptions about noise.
We conduct Wilcoxon pair-wise tests between DLAB/ RLAB and each other five baseline algorithms for the average F1-ratings over 35 datasets. The test results in the Table 2 show no significant difference between DLAB and RLAB. For noise free datasets ( 0% ), DLAB significantly outperforms other baseline algorithms, except XGBoost. RLAB outperforms NDAdaBoost and SML-Boost, but inferior to XGBoost. For noise 10% and 25% , RLAB/ DLAB shows strong anti-disturbance ability, and the resistance of RLAB is better than that of DLAB. When the noise is 25% , DLAB is inferior to RLAB, probably because DLAB tries to find the optimal c m , thus learning some noise.

Conclusion
This paper has presented two boosting algorithms, Discrete L-AdaBoost and Real L-AdaBoost, based on additive logistic models. These two algorithms are supposed to generate a strong composite learner from weak learners Bold indicates the highest ratings and underline for the lowest  and decision trees were used as weak learners in experiment studies. By comparison of different loss function, we show the suitability of negative log-likelihood criterion for classification problems, where abnormal observations occurs. And based on this loss function, we derived two stable boosting algorithms to fit the additive logistic model for classification tasks. Through simulation and real-world dataset experiments, we verified the effectiveness of the proposed algorithm against noise.
Bold indicates the highest ratings and underline for the lowest