Student Name

Wajeeha ParkerFollow

Date of Submission

Fall 2025

Supervisor

Dr. Syed Ali Raza, Assistant Professor, Department of Computer Science

Committee Member 1

Dr. Tariq Mahmood, Examiner – I, Institute of Business Administration (IBA), Karachi, Institute of Business Administration (IBA), Karachi

Committee Member 2

Dr. Syed Farrukh Hasan, Examiner – II, FAST National University

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Keywords

Ensemble Classifiers, Clustering, Multiple Classifier Systems, Cluster Selection, Classification, Evolutionary Algorithm

Abstract

In ensemble learning, a promising strategy for improving base learners is to train them on different subspaces (subsets) of the dataset. However, generating meaningful and diverse subspaces that lead to strong individual classifiers remains a significant challenge, particularly in the presence of class imbalance within subspaces. Poorly constructed subspaces can produce weak learners that ultimately degrade ensemble performance. This thesis proposes a modular approach to address this. It starts with employing a clustering technique to generate candidate subspaces that capture the intrinsic structure of the data. Next, recognizing that not all clusters are equally informative, it incorporates an evolutionary optimization process to filter out low-quality subspaces and retain only the most promising ones. Furthermore, a second optimization step is applied to explore whether further improvements in ensemble performance can be achieved by selecting an optimal subset of base classifiers from a diverse pool and aggregating complementary models more effectively.

Experiments were conducted on a variety of small and large benchmark datasets. The results show that excluding highly imbalanced or homogenous subspaces from the set of candidate subspaces improves ensemble performance across most datasets. Furthermore, removing Support Vector Machine (SVM)-based weak learners from the classifier pool enhanced computational efficiency without compromising accuracy.

For smaller datasets, Particle Swarm Optimization (PSO) boosted performance when applied to both cluster filtering and classifier selections. In contrast, for larger datasets, Binary PSO for subspace optimization and SHAP-based methods for classifier selection yielded superior results. Notably, for large datasets, the second optimization step—focused on base-classifier selection—did not offer further performance gains; optimal subspace selection alone was sufficient to match or surpass state-of-the-art ensemble methods such as Random Forest and XGBoost.

These findings highlight that while optimization techniques for subspace generation and classifier aggregation are effective, their efficacy varies with dataset size and complexity, and strategies effective on smaller datasets may not generalize well to larger ones.

Document Type

Restricted Access

Submission Type

Thesis

Share

COinS