Abstract:
Class imbalance is a frequently occurring scenario in classification tasks. Learning
from imbalanced data poses quite a challenge which has instigated a lot of research in
this area. Various techniques have been developed over the years to tackle this problem.
These approaches are broadly classified into two categories: Data-level modification
and Algorithm-level modification. In data-level modification, the original class
distribution in the data is altered through resampling techniques. In algorithm-level
modification, the traditional classification algorithms are adjusted to the imbalanced
scenarios by changing the cost function and making them cost-sensitive (CS).
A lot of different data resampling and CS techniques have been proposed by researchers
in the past decade. To understand their strengths and weaknesses, a comprehensive
experimental analysis is first conducted to obtain insights about these techniques.
Several limitations have been identified that limit the performance of these approaches.
Most of these techniques do not take into consideration data intrinsic characteristics
that complicate the learning process. Several data difficulty factors have been
identified in some previous studies which are rarely addressed in most cases. Moreover,
the application of many of these techniques overfits the data and causes a loss
of generalization, producing poor performance while testing. They are also unable to
provide well-generalized performance on a wide range of imbalanced scenarios.
In this study, novel strategies have been developed to address these issues. Solutions
have been proposed to limit the effects of different data difficulty factors and
enhance prediction performance. Moreover, attempts have been made to overcome
the shortcomings of the established approaches and obtain better generalization. Three
different methods have been proposed in this study. First, a novel data resampling technique
that takes into consideration data intrinsic characteristics to effectively balance
the dataset. Second, an instance complexity-based CS technique which is an advanced
modification to the original CS approach. Third, a hybrid framework combining resampling
and CSL.
Rigorous experiments have been conducted on a wide range of imbalanced datasets
to validate the performance of the proposed approaches. The results have been evaluated
on eight different performance measures and compared with other state-of-the-art
techniques used in imbalanced learning. Superior results have been obtained from the
proposed techniques on different imbalanced scenarios. The results demonstrate the
efficacy of the proposed models in learning from imbalanced data. To conclude, this research delineates new trajectories in the field of the imbalanced
domain. New approaches have been proposed that introduce fresh perspectives and
directions in imbalanced learning. The proposed strategies are remarkably successful,
ensuring well-generalized performance when addressing imbalanced data.
Description:
This thesis submitted in partial fulfillment of the requirements for the degree of Masters of Science in Computer Science and Engineering of East West University, Dhaka, Bangladesh