Date of Submission
Spring 2025
Supervisor
Dr. Tariq Mahmood, Professor and Program Coordinator MS(CS) and MS(DS) Programs, School of Mathematics and Computer Science (SMCS)
Co-Supervisor
Dr. Munira Moosajee
Committee Member 1
Dr. Tariq Mahmood
Committee Member 2
Dr. Tahir Syed, Examiner-I, Institute of Business Administration
Committee Member 3
Dr. Behroz Mirza, Examiner-II, Habib University
Degree
Master of Science in Data Science
Department
Department of Computer Science
Faculty/ School
School of Mathematics and Computer Science (SMCS)
Keywords
Multi-objective optimization, Genetic algorithms, Feature selection, Breast cancer prediction
Abstract
This thesis presents a novel multi-objective genetic algorithm framework for clinical feature
selection in breast cancer risk prediction, addressing the critical gap between predictive
performance and clinical interpretability in automated feature selection methods. Traditional
feature selection approaches optimize for statistical performance alone, often selecting
algorithmically convenient but clinically meaningless variables, limiting their real-world
applicability in medical decision support systems.
The proposed framework systematically integrates expert oncological knowledge into
evolutionary optimization through three innovative variants: Clinical Expert-Guided Geneatic
Algorithm (GA), Adaptive GA, and Multi-Population GA. These methods simultaneously
optimize predictive performance, clinical relevance, and model parsimony using the
comprehensive PLCO (Prostate, Lung, Colorectal, and Ovarian) cancer screening trial data
containing 78,209 participants and 176 related features.
Experimental results demonstrate distinct performance-interpretability trade-offs across
the three variants. The Clinical Expert-Guided GA achieves excellent clinical interpretability
(3.79/5 clinical relevance score, 79% clinically relevant features) while maintaining competitive
predictive performance (AUC = 0.757) using only 29 features. The Adaptive GA achieves
superior predictive performance (AUC = 0.948, F1 = 0.444) representing substantial
improvements over baseline methods, but with reduced clinical interpretability (2.09/5 clinical
score, 25.9% clinically relevant features). The Multi-Population GA provides specialized
solutions optimized for different clinical scenarios: an efficiency variant achieving 0.934 AUC
with 32 features, a clinical-focused variant achieving 54.3% clinical relevance, and a diversity
variant achieving 0.464 F1 score with balanced precision-recall performance. Cross-dataset
validation on the METABRIC dataset confirms framework generalizability, with the Adaptive
GA maintaining 99.3% of baseline performance while the Clinical GA achieves the highest
clinical interpretability across both datasets.
The framework's primary contributions include: (1) first systematic integration of expert
medical knowledge into multi-objective genetic algorithms for breast cancer prediction, (2)
novel clinical-aware genetic operators and adaptive optimization strategies, (3) comprehensive
multi-objective optimization enabling deployment flexibility based on clinical requirements, and
(4) clinical validation demonstrating that domain knowledge integration enhances rather than
xicompromises model performance. This work advances the field of clinically-applicable
evolutionary computation and provides multiple deployment-ready solutions addressing
different clinical priorities in breast cancer risk assessment systems.
Document Type
Restricted Access
Submission Type
Thesis
Recommended Citation
Humayun, S. (2025). Novel Multi-objective Feature Selection Framework for 5-year Breast Cancer Risk Prediction (Unpublished Unpublished graduate thesis). Retrieved from https://ir.iba.edu.pk/etd-ms-ds/10
