Date of Submission

Fall 2024

Supervisor

Dr. Tariq Mahmood, Professor and Program Coordinator MS(CS) and MS(DS) Programs, School of Mathematics and Computer Science (SMCS)

Co-Supervisor

Dr. Aysha Almas, The Aga Khan University

Committee Member 1

Dr. Gerald S. Bloomfield, Co-Supervisor, Duke University

Committee Member 2

Dr. Zainab Samad, Co-Supervisor, Aga Khan University

Committee Member 3

Dr. Muhammad Rafi, Examiner – I, FAST ; Dr. Aisha Shaikh, Examiner – II, Aga Khan University

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Keywords

Diabetes, Machine Learning, Cluster analysis, ICD codes, K-Means clustering, K-Mode’s clustering, NLP, TF-IDF, LDA

Abstract

Background: Diabetes is a prevalent health condition rising rapidly in lower- and middle-income countries. Pakistan stands at 3rd position in terms of prevalence of diabetes in the world. Type 2 diabetes is a serious health concern in Pakistan mainly due to obesity in the middle-aged population, suboptimal physical activity, unhealthy food practices, low literacy rate, lack of awareness and willingness to get treatment or change lifestyle habits, and poverty. The existence of diabetes in the age group 20-79 years in Pakistan in 2021 was more than 32 million. Diabetes complications and mortality can be prevented if treated with consistency. It is important to find out what features and disease patterns of patient data increase the risk of mortality in diabetic patients admitted to the hospital. Identifying relevant factors and disease patterns will help design preventive and therapeutic strategies to prevent mortality in diabetic patients. Unlike regression, machine learning clustering can take large datasets and identify patterns of homogeneity or heterogeneity within data points without human supervision. This eliminates any analytical bias by a human being. Cluster analysis is particularly useful when common co-occurrences of comorbidities are ambiguous, making it difficult to categorize patients and develop effective treatment plans. By grouping patients with similar patterns of diseases, this method can help uncover hidden relationships between comorbidities, enabling more tailored interventions and improving the overall management of complex health conditions.

Methods: This study uses unsupervised machine learning – cluster analysis – to identify optimal disease patterns of mortality in diabetic patients in Karachi, Pakistan. The patient cohort was selected from the inpatient Aga Khan University, Karachi, database from 2008 to 2021. All adults above 18 years of age who had either type 1 or type 2 diabetes, were included. The ICD codes 9 and 10 were used to extract diagnosis data. The data was integrated from the hospital information management systems (HIMS), laboratory records, and pharmacy. The most recent lab test results from the first 48 hours were observed. For drug inclusion, the first order of drugs administered to the patient on the first day and the last day of admission was included in the data. The cluster analysis algorithm, K-Modes (an extension of KMeans used for categorical data), divided the patient data into 4 subgroups. The optimal cluster number was decided using the elbow curve method and expert feedback. The study's findings have important implications for diabetes management in Pakistan and beyond, providing insights into mortality patterns and informing targeted interventions to improve patient outcomes.

Results: Based on K-Modes clustering and clinical relevance, four subgroups of patients were identified. Cluster 3 had the highest proportion of males (63%), while Clusters 1 and 4 had the highest proportion of females (53.7% and 53.5%, respectively). Cluster 1 consisted primarily of younger individuals (18-40 years), Cluster 2 of older individuals (60+ years), and Clusters 3 and 4 of middle-aged individuals (41-60 years). Patients with uncontrolled diabetes formed a significant part of each cluster, while hypertension was prominent in Clusters 2, 3, and 4. Cluster 1 was primarily characterized by patients with high levels of HbA1c (prolonged uncontrolled diabetes), where pneumonia, tobacco use, and substance abuse disorders were highly prevalent. Cluster 2 was defined by kidney-related diseases, including nephropathy, acute kidney injury, and renal and metabolic disorders. Sepsis was most frequently found in Clusters 1 and 2. Cluster 3 was characterized by cardiovascular-related diabetic complications and included patients mostly with tumors, acute myocardial infarction, and coronary artery disease. Cluster 4 was dominated by high blood glucose and featured a mix of diseases from Clusters 1 and 2, along with patients suffering from stroke and its effects. The mortality patterns displayed highest mortality occurring in cluster 2, followed by cluster 3. The most frequent disease combinations associated with mortality in Cluster 1 were heart disease, sepsis, renal disorders, and metabolic disorders (n=67). In Cluster 2, the predominant combination was myopathy, sepsis, and pneumonia (n=9). In Cluster 3, it was sepsis, heart disease, acute kidney injury, and hypertension (n=60). In Cluster 4, the leading combinations were sepsis, hypertension, uncontrolled diabetes, acute kidney injury, pneumonia, and neuropathy (n=88).

Conclusion: Cluster analysis identified subgroups of diabetic patients based on diagnosis patterns of in-patients. According to domain knowledge experts, these clusters are representative of the Pakistani population. The clustering approach provides valuable insights into mortality risk factors, with each cluster showing specific patterns of comorbidities such as hypertension, kidney disease, and cardiovascular disorders. Discrepancies between abnormal lab results and corresponding diagnoses in some patients indicated potential gaps in disease coding, highlighting areas for further investigation. Machine learning has enabled to cluster large datasets and uncover patterns that may inform more targeted treatment strategies. The clustering model optimized in this study has significant potential for tailoring interventions to high-risk groups, particularly in managing diabetes-related complications in Pakistan by optimizing resource allocation. Additionally, the approach is scalable and can be applied to cohorts from other healthcare institutions for validation, making it a robust tool for improving patient outcomes and informing targeted healthcare strategies.

Document Type

Restricted Access

Submission Type

Thesis

Available for download on Saturday, October 31, 2026

Share

COinS