Student Name

Saadia HumayunFollow

Date of Submission

Spring 2025

Supervisor

Dr. Tariq Mahmood, Professor and Program Coordinator MS(CS) and MS(DS) Programs, School of Mathematics and Computer Science (SMCS)

Co-Supervisor

Dr. Munira Moosajee

Committee Member 1

Dr. Tariq Mahmood

Committee Member 2

Dr. Tahir Syed, Examiner-I, Institute of Business Administration

Committee Member 3

Dr. Behroz Mirza, Examiner-II, Habib University

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Keywords

Multi-objective optimization, Genetic algorithms, Feature selection, Breast cancer prediction

Abstract

This thesis presents a novel multi-objective genetic algorithm framework for clinical feature

selection in breast cancer risk prediction, addressing the critical gap between predictive

performance and clinical interpretability in automated feature selection methods. Traditional

feature selection approaches optimize for statistical performance alone, often selecting

algorithmically convenient but clinically meaningless variables, limiting their real-world

applicability in medical decision support systems.

The proposed framework systematically integrates expert oncological knowledge into

evolutionary optimization through three innovative variants: Clinical Expert-Guided Geneatic

Algorithm (GA), Adaptive GA, and Multi-Population GA. These methods simultaneously

optimize predictive performance, clinical relevance, and model parsimony using the

comprehensive PLCO (Prostate, Lung, Colorectal, and Ovarian) cancer screening trial data

containing 78,209 participants and 176 related features.

Experimental results demonstrate distinct performance-interpretability trade-offs across

the three variants. The Clinical Expert-Guided GA achieves excellent clinical interpretability

(3.79/5 clinical relevance score, 79% clinically relevant features) while maintaining competitive

predictive performance (AUC = 0.757) using only 29 features. The Adaptive GA achieves

superior predictive performance (AUC = 0.948, F1 = 0.444) representing substantial

improvements over baseline methods, but with reduced clinical interpretability (2.09/5 clinical

score, 25.9% clinically relevant features). The Multi-Population GA provides specialized

solutions optimized for different clinical scenarios: an efficiency variant achieving 0.934 AUC

with 32 features, a clinical-focused variant achieving 54.3% clinical relevance, and a diversity

variant achieving 0.464 F1 score with balanced precision-recall performance. Cross-dataset

validation on the METABRIC dataset confirms framework generalizability, with the Adaptive

GA maintaining 99.3% of baseline performance while the Clinical GA achieves the highest

clinical interpretability across both datasets.

The framework's primary contributions include: (1) first systematic integration of expert

medical knowledge into multi-objective genetic algorithms for breast cancer prediction, (2)

novel clinical-aware genetic operators and adaptive optimization strategies, (3) comprehensive

multi-objective optimization enabling deployment flexibility based on clinical requirements, and

(4) clinical validation demonstrating that domain knowledge integration enhances rather than

xicompromises model performance. This work advances the field of clinically-applicable

evolutionary computation and provides multiple deployment-ready solutions addressing

different clinical priorities in breast cancer risk assessment systems.

Document Type

Restricted Access

Submission Type

Thesis

Share

COinS