BSCS Final Year Projects

Machine Learning Based Cancer-Risk Prediction Using Structural and Sequential Data

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Taslim Murad, Assistant Professor, Department of Computer Science

Keywords

BRCA1/BRCA2 Variant Classification, Explainable AI, 3D Protein Analysis, Structure, Machine Learning, Structural Features, Sequential Features

Abstract

Proper classification of BRCA1 and BRCA2 gene mutations plays a crucial role in estimating the risk of hereditary breast and ovarian cancers. However, existing computational approaches to variant pathogenicity prediction are still limited due to their inability to utilize non-sequence-based features and interpret results clinically. The current research aims to develop an efficient machine learning approach based on integrating sequence-based and structural properties for improving classification accuracy and providing clinical interpretation. Variant information was downloaded from ClinVar, and 74 sequence features were obtained via Ensembl VEP. In addition, 5 three-dimensional structural features were calculated for each mutation using crystal structures of PDB protein models, FoldX, UCSF ChimeraX, and Biopython software. Finally, three feature sets were examined within the following machine learning models: Random Forest, XGBoost, Decision Tree, KNN, and Logistic Regression. It was shown that combining both sequence and structure information resulted in higher classification performance than using only one of the feature sets. Overall, Random Forest provided the best results with a classification accuracy of 0.892 and ROC-AUC of 0.931. Analysis via SHAP explainability showed that both types of features—sequence-based features like ClinPred and SIFT, and structural features like SASA and DDG—fell into the top ten features that were the most impactful, thus proving the significance of combining the two data sources. This finding implies that using both kinds of data in developing predictors results in more robust mutation predictors with clear biological interpretation, which could potentially impact clinical counseling and cancer prediction.

Tools and Technologies Used

Python, Clinvar, FoldX, ChimeraX, Uniport, PBD, BioPython, Sklearn

Methodology

Data Collection:

ClinVar was utilized as the primary data source for extracting both sequence and structural variations. Human variants associated with the BRAC1 and BRCA2 genes were isolated. The target label utilized for this classification task is the IMPACT attribute, which comprises four distinct levels: High, Moderate, Low, and Modifier. This target variable was carefully selected and isolated to prevent any potential data leakage during downstream preprocessing and modeling.

Feature Extraction from Sequence Pipeline:

The sequence-based feature extraction pipeline initiated by processing raw data on ClinVar variants through Ensembl’s Variant Effect Predictor (VEP). This step yielded an initial genomic dataset consisting of 85 feature columns across 33,745 sample rows.

Structural Feature Extraction:

For the structural pipeline, three-dimensional macromolecular data for BRCA1 and BRCA2 were extracted. The original, raw Protein Data Bank (PDB) files were processed and cleaned utilizing the UCSF ChimeraX software tool, which successfully excluded all water molecules and non-protein ligands. To compute the relevant thermodynamic alterations induced by the identified point mutations, the FoldX empirical force field approach was systematically employed.

Integration of Datasets:

To fuse the distinct biological modalities, a specialized mutation mapping key was derived from the unprocessed data generated by the VEP. Concurrently, a custom Python standardization script was executed to remove chain identifiers from the structural output files, establishing a standardized, uniform mutation key column. An inner merge operation was subsequently executed via the Pandas library to seamlessly integrate the sequence and structural datasets into a unified master dataset.

Machine Learning and Evaluation:

Three distinct data configurations were derived from the master integrated dataset for comparative analysis: a sequence-only subset, a structure-only subset, and the combined multimodal dataset. For each of these three configurations, five distinct machine learning classification algorithms were implemented: Decision Tree, Random Forest, XGBoost, KNN, and Logistic Regression. Each model configuration was rigorously validated and evaluated using a comprehensive suite of performance metrics, including accuracy, Receiver Operating Characteristic Area Under the Curve (ROC-AUC), precision, recall, and F1-score. SHAP was used as an explainable AI tool to evaluate the results.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Recommended Citation

Luhana, D., Irfan, A., & Detho, A. H. (2026). Machine Learning Based Cancer-Risk Prediction Using Structural and Sequential Data. Retrieved from https://ir.iba.edu.pk/fyp-bscs/39

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

BSCS Final Year Projects

Machine Learning Based Cancer-Risk Prediction Using Structural and Sequential Data

Degree

Department

School

Advisor

Keywords

Abstract

Tools and Technologies Used

Methodology

Document Type

Submission Type

Recommended Citation

Creative Commons License

Related Content

Browse

Search

Author Corner

LINKS

BSCS Final Year Projects

Machine Learning Based Cancer-Risk Prediction Using Structural and Sequential Data

Student Name

Degree

Department

School

Advisor

Keywords

Abstract

Tools and Technologies Used

Methodology

Document Type

Submission Type

Recommended Citation

Creative Commons License

Related Content

Share

Browse

Search

Author Corner

LINKS