Prof. Carolina Ruiz, Advisor, WPI- Computer Science
Prof. Sergio A. Alvarez, Boston College, External committee member
Prof. Gabor N. Sarkozy, WPI – Computer Science
Prof. Xiangnan Kong, WPI – Computer Science
In this dissertation, we introduce a novel bootstrap aggregating, or bagging, framework for supervised classification. Our proposed framework, called ‘mixed bagging’, is a form of bagging in which the resampling process takes into account the classification hardness of the training instances. The classification hardness, or simply hardness, of an instance is defined as the probability that the instance will be misclassified by a classification model built from the remaining instances in the training set. We incorporate instance hardness into the bagging process by varying the resampling probability of each instance based on its estimated hardness. Bootstraps of differing hardness can be created in this way by over-representing, under-representing and equally representing hard instances. A diverse committee of classifiers is then induced from these bootstraps, and their individual classification outputs are aggregated to achieve a final output. We propose two versions of mixed bagging: grouped – where the bootstraps are grouped as easy, regular or hard, with all bootstraps in one group having the same hardness; and incremental – where the hardness of bootstraps changes gradually from one bootstrap to the next.
We test our framework on 47 publicly available binary classification problems using three different classification algorithms as base learners. Our results show that the proposed mixed bagging methods outperform traditional bagging and weighted bagging (wagging) on a large number of these datasets. We conduct an in-depth analysis of the results for different hyperparameter settings of the mixed bagging framework, and present our findings on the conditions under which mixed bagging is expected to perform well. We further examine mixed bagging through the lens of some standard techniques used to characterize ensemble learning methods, including bias-variance decomposition, margin distribution and formation of decision boundaries. The results on the theoretical properties of mixed bagging together with the experimental results on its classification performance demonstrate that mixed bagging is a valuable addition to the field of ensemble learning.