Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Muhammad Atif Tahir, Professor and Program Coordinator, Graduate & Postgraduate Programs (CS), Department of Computer Science

Keywords

Advertisement Memorability, Multimodal Learning, Artificial Intelligence, Computer Vision, Natural Language Processing, Retrieval-Augmented Generation

Abstract

This project presents a multimodal advertisement intelligence system that analyzes and predicts ad memorability using visual, audio, and textual representations extracted from video content. It leverages deep learning–based embeddings and modality-specific models to generate a unified memorability score along with interpretable contributions from each modality. In parallel, the system uses a large language model to generate and refine advertisement ideas based on retrieved similar ads through a retrieval-augmented generation approach, enabling creative ad synthesis grounded in real examples. Its objective is to bridge the gap between creative content generation and measurable audience recall by integrating deep learning–based memorability prediction with retrieval-augmented large language model generation.

Tools and Technologies Used

Python, PyTorch, Scikit-learn, SentenceTransformers (all-mpnet-base-v2), OpenAI Whisper, Google Gemini API, pgvector, PostgreSQL (Supabase), FastAPI, NumPy, Pandas, OpenCV, VGGish, ResNeXt, BERT, Hugging Face Transformers, React

Methodology

creative generation. The system processes video advertisements by extracting visual, audio, and textual information using specialized models and feature extractors. Visual features are obtained through deep convolutional architectures such as ResNeXt and BERT-based embeddings for semantic understanding of scene level descriptions, while audio features are extracted using VGGish and Whisper for speech-to-text transcription and acoustic representation. These heterogeneous features are then encoded into dense embeddings and used to train separate regression models, along with traditional machine learning models such as XGBoost and Random Forest, to predict modality-specific memorability scores. To unify these signals, a weighted fusion mechanism combines individual modality predictions into a final memorability score, while also providing interpretability through modality-level contribution analysis.

In parallel, a retrieval-augmented generation (RAG) pipeline is built using MPNet-based embeddings stored in a vector database (pgvector), enabling the system to retrieve similar high-performing advertisements as contextual grounding. A large language model (Gemini) then generates new advertisement scripts based on user input and retrieved examples, producing structured, scene-wise ad narratives. This combined approach enables both analytical evaluation of advertisement effectiveness and data-driven creative generation within a single integrated framework.

The system is deployed using a full-stack architecture where the frontend is built in React to provide an interactive interface for users to input prompts, view generated advertisements, and analyze memorability predictions and explanations. The backend is developed using FastAPI, which serves as the core orchestration layer connecting all components of the system, including feature extraction modules, machine learning models, retrieval-augmented generation pipeline, and the database layer. FastAPI handles API requests for ad generation and prediction, manages communication with the LLM (Gemini), and integrates retrieval from the vector database (pgvector).

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS