Degree
Bachelor of Science (Computer Science)
Department
Department of Computer Science
School
School of Mathematics and Computer Science (SMCS)
Advisor
Dr. Taslim Murad, Assistant Professor, Department of Computer Science
Keywords
BRCA1/BRCA2 Variant Classification, Explainable AI, 3D Protein Analysis, Structure, Machine Learning, Structural Features, Sequential Features
Abstract
Proper classification of BRCA1 and BRCA2 gene mutations plays a crucial role in estimating the risk of hereditary breast and ovarian cancers. However, existing computational approaches to variant pathogenicity prediction are still limited due to their inability to utilize non-sequence-based features and interpret results clinically. The current research aims to develop an efficient machine learning approach based on integrating sequence-based and structural properties for improving classification accuracy and providing clinical interpretation. Variant information was downloaded from ClinVar, and 74 sequence features were obtained via Ensembl VEP. In addition, 5 three-dimensional structural features were calculated for each mutation using crystal structures of PDB protein models, FoldX, UCSF ChimeraX, and Biopython software. Finally, three feature sets were examined within the following machine learning models: Random Forest, XGBoost, Decision Tree, KNN, and Logistic Regression. It was shown that combining both sequence and structure information resulted in higher classification performance than using only one of the feature sets. Overall, Random Forest provided the best results with a classification accuracy of 0.892 and ROC-AUC of 0.931. Analysis via SHAP explainability showed that both types of features—sequence-based features like ClinPred and SIFT, and structural features like SASA and DDG—fell into the top ten features that were the most impactful, thus proving the significance of combining the two data sources. This finding implies that using both kinds of data in developing predictors results in more robust mutation predictors with clear biological interpretation, which could potentially impact clinical counseling and cancer prediction.
Tools and Technologies Used
Python, Clinvar, FoldX, ChimeraX, Uniport, PBD, BioPython, Sklearn
Methodology
Data Collection:
ClinVar was utilized as the primary data source for extracting both sequence and structural variations. Human variants associated with the BRAC1 and BRCA2 genes were isolated. The target label utilized for this classification task is the IMPACT attribute, which comprises four distinct levels: High, Moderate, Low, and Modifier. This target variable was carefully selected and isolated to prevent any potential data leakage during downstream preprocessing and modeling.
Feature Extraction from Sequence Pipeline:
The sequence-based feature extraction pipeline initiated by processing raw data on ClinVar variants through Ensembl’s Variant Effect Predictor (VEP). This step yielded an initial genomic dataset consisting of 85 feature columns across 33,745 sample rows.
Structural Feature Extraction:
For the structural pipeline, three-dimensional macromolecular data for BRCA1 and BRCA2 were extracted. The original, raw Protein Data Bank (PDB) files were processed and cleaned utilizing the UCSF ChimeraX software tool, which successfully excluded all water molecules and non-protein ligands. To compute the relevant thermodynamic alterations induced by the identified point mutations, the FoldX empirical force field approach was systematically employed.
Integration of Datasets:
To fuse the distinct biological modalities, a specialized mutation mapping key was derived from the unprocessed data generated by the VEP. Concurrently, a custom Python standardization script was executed to remove chain identifiers from the structural output files, establishing a standardized, uniform mutation key column. An inner merge operation was subsequently executed via the Pandas library to seamlessly integrate the sequence and structural datasets into a unified master dataset.
Machine Learning and Evaluation:
Three distinct data configurations were derived from the master integrated dataset for comparative analysis: a sequence-only subset, a structure-only subset, and the combined multimodal dataset. For each of these three configurations, five distinct machine learning classification algorithms were implemented: Decision Tree, Random Forest, XGBoost, KNN, and Logistic Regression. Each model configuration was rigorously validated and evaluated using a comprehensive suite of performance metrics, including accuracy, Receiver Operating Characteristic Area Under the Curve (ROC-AUC), precision, recall, and F1-score. SHAP was used as an explainable AI tool to evaluate the results.
Document Type
Restricted Access
Submission Type
BSCS Final Year Project
Recommended Citation
Luhana, D., Irfan, A., & Detho, A. H. (2026). Machine Learning Based Cancer-Risk Prediction Using Structural and Sequential Data. Retrieved from https://ir.iba.edu.pk/fyp-bscs/39
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.
