Date of Submission
Fall 2025
Supervisor
Dr. Syed Ali Raza, Assistant Professor, Department of Computer Science
Committee Member 1
Dr. Tariq Mahmood, Examiner – I, Institute of Business Administration (IBA), Karachi, Institute of Business Administration (IBA), Karachi
Committee Member 2
Dr. Syed Farrukh Hasan, Examiner – II, FAST National University
Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Keywords
Ensemble Classifiers, Clustering, Multiple Classifier Systems, Cluster Selection, Classification, Evolutionary Algorithm
Abstract
In ensemble learning, a promising strategy for improving base learners is to train them on different subspaces (subsets) of the dataset. However, generating meaningful and diverse subspaces that lead to strong individual classifiers remains a significant challenge, particularly in the presence of class imbalance within subspaces. Poorly constructed subspaces can produce weak learners that ultimately degrade ensemble performance. This thesis proposes a modular approach to address this. It starts with employing a clustering technique to generate candidate subspaces that capture the intrinsic structure of the data. Next, recognizing that not all clusters are equally informative, it incorporates an evolutionary optimization process to filter out low-quality subspaces and retain only the most promising ones. Furthermore, a second optimization step is applied to explore whether further improvements in ensemble performance can be achieved by selecting an optimal subset of base classifiers from a diverse pool and aggregating complementary models more effectively.
Experiments were conducted on a variety of small and large benchmark datasets. The results show that excluding highly imbalanced or homogenous subspaces from the set of candidate subspaces improves ensemble performance across most datasets. Furthermore, removing Support Vector Machine (SVM)-based weak learners from the classifier pool enhanced computational efficiency without compromising accuracy.
For smaller datasets, Particle Swarm Optimization (PSO) boosted performance when applied to both cluster filtering and classifier selections. In contrast, for larger datasets, Binary PSO for subspace optimization and SHAP-based methods for classifier selection yielded superior results. Notably, for large datasets, the second optimization step—focused on base-classifier selection—did not offer further performance gains; optimal subspace selection alone was sufficient to match or surpass state-of-the-art ensemble methods such as Random Forest and XGBoost.
These findings highlight that while optimization techniques for subspace generation and classifier aggregation are effective, their efficacy varies with dataset size and complexity, and strategies effective on smaller datasets may not generalize well to larger ones.
Document Type
Restricted Access
Submission Type
Thesis
Recommended Citation
Parker, W. (2025). A Modular Approach to Cluster-based Ensemble Learning: Optimizing Subspace Design and Classifier Aggregation (Unpublished Unpublished graduate thesis). Retrieved from https://ir.iba.edu.pk/etd-ms-ds/9
