How to apply Boosting when the Data Labels are Noisy and Uncertain ?

LocalBoost - Local Boosting for Weakly-Supervised Learning

Feb 28, 2024

I blog about latest machine learning research topics that have an immediate impact on the work us data scientists/machine learning engineers do every day. Share the newsletter with your friends so that we all grow together.

Share Machine Learning Diaries

Introduction

Boosting uses ensemble as a means to enhance performance of the model. A large number of boosting algorithms exist, starting from AdaBoost to the ever so popular XGBoost. But these algorithms suffers from a likely handicap that boosting is typically used when the data labels are accurate [1].

In real world applications, it is increasingly clear that many a times the processes used to label data, do not generate accurate clean data. It then becomes a non trivial task to design ensembles when the data is noisy and weak. This is where WSL (Weakly supervised learning) have started gaining attention. ‘WSL leverages weak supervision signals to generate a large amount of weakly labeled data, which is easier to obtain than complete annotations’[1]. But it has been found that state-of-the-art WSL methods still underperform fully-supervised methods by an average performance discrepancy is 18.84%, measured by accuracy or F1 score [1].

Challenges with apply Boosting to weak and noisy labels

Figure 1: Heatmap of base learner weights in model ensembles. Suffix “-FS” indicates fully-supervised settings using clean labels and suffix “-WS” indicates weakly-supervised settings. Each model ensemble consists of 10 base learners, and their weights are shown by color in the heatmap.

As shown in Figure 1, if we apply supervised boosting methods while using weakly labeled dataset, we observe that the weight assigned to the initial base model is rather large such that it dominates the ensemble model prediction. This phenomenon is called “weight domination” by the authors of the LocalBoost [1].

A key challenge of adapting boosting methods to the weakly supervised learning(WSL) setting is to accurately compute the importance of each example in the weakly labeled training data for each base learner. However, when the labels are noisy and weak, accurate identification of error instances is a hinderance and thus authors [1] believe that we need to shift our focus from the label space to the training data.

Deeper Look at Local Boost

In the context of WSL, where most of the data are labeled by weak sources and only a limited number of data points have accurate labels. Authors[1] have introduced a novel iterative and adaptive framework for WSL boosting - Local Boost.

Figure 2: An illustrative example of base learner localization on a 2D plane. To localize the base learner 𝑓𝑡,𝑙 (𝑥), we first implement an ensemble inference on the clean dataset D𝑐 to identify 𝑘 large error instances 𝑠1, · · · , 𝑠𝑘 . Next, we sample 𝑘 clusters 𝑆1, · · · , 𝑆𝑘 on the weakly labeled dataset D𝑙 . Then the base learner 𝑓𝑡,𝑙 (𝑥) is trained on the local regions consisting of 𝑆1, · · · , 𝑆𝑘 . Here we emphasize the clean dataset D𝑐 is only for validation use. It guides the base learner localization but is not involved in the training. This figure takes 𝐹𝑡,𝑙−1 (𝑥) as the preceding ensemble, it could also be 𝐹𝑡−1,𝑙 (𝑥) when the loop over 𝑝 weak sources is completed

In the Figure 2, we observe that the clean data set Dc is used to identify instances s1,s2…sk. These instances are then used to identify clusters in the weakly labeled and noisy dataset Dl. The base learner then learns on these clusters. An important thing to note is that clean dataset influences clustering, but does not play a direct role in training the model.

Figure 3: The illustration of Estimate-then-Modify scheme for model weight calculation. We first estimate the base learner weight 𝛼𝑡,𝑙 on D𝑙 and form the new ensemble 𝐹𝑡,𝑙 (𝑥). Next, we generate a group of perturbed weight vectors V𝑡,𝑙 by adding Gaussian and normalizing each of them. Finally, we select the weight vector that achieves the lowest total error on D𝑐

As a next step ,as illustrated in Figure 3, authors[1] propose an estimate-then-modify paradigm for the model weights computation, in which a large number of weak labels are leveraged for the weight estimate, then the limited clean labels are used to rectify the estimated weight.

In particular, authors[1] first follow the AdaBoost procedure to estimate the weight 𝛼_𝑡,𝑙 on the weakly labeled dataset D_l. Merely using the noisy weak labels can hardly guarantee the boosting progress due to the fact that calculating error rates involves weak labels, which are less unreliable and can negatively affect the weight estimation. To address this issue, we further calibrate such estimated weights using the small clean dataset D𝑐 with a perturbation-based approach. In this approach, we add gaussian perturbation to the weights, and choose the perturbation that have the least validation error on the clean data set, Dc.

Results

Table 1 below gives the results achieved on implementation of Local Boost.

Table 1 - Comparison of Local Boost with other published techniques

References

LocalBoost Research Paper

Machine Learning Diaries