Date of Submission

Fall 2025

Supervisor

Dr. Tariq Mahmood, Professor, School of Mathematics and Computer Science (SMCS)

Committee Member 1

Dr. Tariq Mahmood, Supervisor, Department of Computer Science School of Mathematics and Computer Science (SMCS) Institute of Business Administration (IBA), Karachi

Committee Member 2

Dr. Sajjad Haider, Examiner – I, Institute of Business Administration (IBA), Karachi, Institute of Business Administration (IBA), Karachi

Committee Member 3

Dr. Mohammad Rafi, Examiner – II, Professor & Department Head (AI & DS), Fast National University Karachi

Degree

Master of Science in Data Science

Department

Department of Computer Science

Faculty/ School

School of Mathematics and Computer Science (SMCS)

Keywords

Class imbalance, Conditional Variational Autoencoders, Markov Chain Monte Carlo, Large Language Models, Differential Evolution, Synthetic data

Abstract

Class imbalance in tabular data presents a critical challenge to machine learning performance. This thesis introduces and validates a novel, modular framework to investigate the efficacy of modern generative models. The core contribution is a comprehensive, head-to-head comparison of three distinct generative paradigms: a classical augmentation suite (including SMOTE, ADASYN, and TVAE), an advanced CVAE-based framework, and a semantic LLM-based framework. The CVAE framework introduces novel methods for Differential Evolution (DE) optimization and MCMC latent space sampling, while the LLM framework pioneers the use of a fine-tuned Phi-3-mini for direct, prompt-based tabular generation. Validation is performed on five diverse datasets, benchmarking all methods on downstream utility using Weighted F1-Score across a 12-model evaluation suite. The results demonstrate that while classical methods provide a robust baseline, advanced generative models are specialist tools that unlock state-of-the-art performance in specific, critical scenarios where classical methods fail. The primary finding is that the LLM framework is a powerful solution for data-scarce environments. On a small, high-imbalance medical dataset, the LLM achieves a 230% relative improvement over its baseline (0.5647 vs. 0.1707), a result unmatched by any other method, classical or novel. Concurrently, the DE-optimized CVAE framework demonstrates strong specialist utility on large, stable data, achieving the study’s highest score (0.9014) on the large-scale Adult Census dataset. Furthermore, the baseline CVAE proves most effective for the large-scale, extreme-imbalance Credit Card Fraud dataset. Beyond empirical performance, the framework introduces three key innovations: the first systematic computational-intelligence optimization of CVAE architectures for tabular data, novel MCMC implementations with distribution-aware sampling strategies, and the pioneering application of LLMs for structured data synthesis. This modular design enables systematic exploration of hybrid generation strategies, adaptation to domain-specific requirements, and advancement of theoretical understanding in synthetic data generation for imbalanced learning scenarios.

Document Type

Restricted Access

Submission Type

Thesis

Share

COinS