All Theses and Dissertations
Degree
Doctor of Philosophy in Computer Science
Faculty / School
School of Mathematics and Computer Science (SMCS)
Department
Department of Computer Science
Date of Award
Fall 2025
Advisor
Dr. Sayeed Ghani, Professor Department of Computer Science School of Mathematics and Computer Science (SMCS)
Second Advisor
Dr. Muhammad Saeed
Committee Member 1
Dr. Arif Mahmood, Examiner - I ITU, Lahore
Committee Member 2
Dr. Waheed Iqbal, Examiner - II PUCIT, Lahore
Committee Member 3
Dr. Shakeel Khoja, Professor and Dean School of Mathematics and Computer Science
Project Type
Dissertation
Access Type
Restricted Access
Document Version
Full
Pages
154
Keywords
Image Captioning, Context-aware captioning, Capsule Networks, Dynamic Routing, Knowledge Graphs, Transformer Neural Networks, Dynamic Corpus.
Subjects
Computer pattern recognition, Computer Science
Abstract
The biological vision given to all life is achieved through the combination of tiny molecular organisms, which help us view and understand our surroundings. It is one of the most vital forms of receptors a human can have. Generally, human vision is a combination of sensors (the eye) for capturing light, a signal transmission path called a receptor, which allows light signals to travel to the visual cortex. The perfect combination of these systems helps us see and understand the world.
Significant research has been conducted over the last 30 years to provide machines and computers with a similar vision. Earlier work in the 1900s started with capturing light to generate an image (photography). Technology has advanced to meet higher expectations, such as perceiving and understanding what is in an image, just as humans do. However, having something like human perception, understanding, and recognition requires tremendous experience, which our brain has developed over thousands of years through human evolution. For a computer or machine to have similar capability means possessing the same amount of knowledge and an equivalent level of algorithms to process the image as our visual cortex does. The image under observation is quite different from how a machine perceives it. The entire image of a machine is a combination of multiple rows and columns, creating a matrix that represents the formation of image elements. An ideal computer vision system would have an algorithm that interprets the image matrix like how our brain does. However, that is quite far from the future, even in this modern age.
A general approach to achieving the goal of computer vision is to train machines to understand the fundamental geometry, shapes, and orientation of objects in images. The 90s era began with human-crafted shapes, sizes, and other fundamental elements of objects, training the machine to understand images and their contents autonomously. Despite achieving significant success, the research works were far from practical, as minor variations for every object could create millions of viewpoints. For example, a single change in the degree of an object will require a completely new feature map for the machine to learn. In the 2000s, we started using machine learning, which enabled us to train machines with a list of given feature matrices and support many variations. Despite the use of machine learning, a manually created feature matrix was insufficient for achieving success on a broad scale.
The advancement of deep learning enabled a new frontier towards training machines for computer vision. The era of deep learning has enabled machines to learn representative features automatically, rather than manually generating feature matrices. This behavior helps improve learning scopes for computer vision. Due to deep learning, our machines are better trained for various computer vision tasks and sometimes rival humans in specific challenges, such as detecting and segmenting objects within images. Despite this level of success, a true mimicry of the human level of vision is a challenging task that still needs to be solved. The reasons for not achieving true human-level mimicry are likely due to the amount of data required for training, computational resources, and the need for more advanced algorithms, as well as various features such as lighting, color, angle, and shape.
Generally, computer vision encompasses a wide range of tasks, covering various outcomes of human-level vision requirements, such as object detection, object segmentation, object tracking, image classification, scene classification, object classification, image captioning, and video captioning, among others. These primary and essential tasks are the foundation of computer vision. Regarding the complexity level of the tasks, image captioning is one of the most complex tasks for computer vision research. Since image captioning involves object detection, object classification, generation of relationships among the objects, and then describing the entire image in human language. Many innovative and significant research studies with state-of-the-art outcomes have been done in the past six years, generating quite a human-like caption for an image. However, they are focused on something other than the content and context of the story within the caption from the image's perspective, as we humans do.
In the Storyteller, I brought a more human-like storytelling system that can caption images with the perspective of content, context, and knowledge. We have attempted to provide a working solution for several applications, including generating datasets for training Self-driving vehicles, videos for subtitle generation, and giving suggestive reasoning over MRI images. Our methodology combines capsule networks for image encoding, knowledge graphs for context and content, and transformer neural networks for text generation. Capsule networks extract spatial and orientational details from the images during feature extraction. Using a knowledge graph as a knowledge engine finds content, context, and semantics from the corpus against the generated features from the feature encoding stage. The decoding phase comprises transformer neural networks fed by the knowledge graph-driven annotation iterator. We have utilized dynamic multi-headed attention in transformer neural networks to make the model more tangible in terms of time and memory. Transformer neural networks are good at long-term dependencies, which is essential for generating the referential context of a sentence from its previous sentence.
We have utilized the MS COCO and Flickr 8k and 30k datasets for training and validation purposes. For testing, MS COCO, Flickr 8k, Flickr 30k, NoCaps, and Conceptual Caps 3M/12M have been used. MS COCO and Flickr 8k/30k have been used by state-of-the-art research in image captioning and serve as the benchmark dataset. The results have contents and contextual information, and metrics like B4: 71.93, M: 39.14, C: 136.53, and R: 94.32. The usage of adverbs and adjectives within the generated sentence, according to the objects' geometrical and semantic relationships, is phenomenal. The results also reflected an in-depth understanding of positional information within the generated text due to the positional understanding encoding engine. One of the key attributes of the Storyteller's result is dense captioning. A single image is captioned with up to three sentences that demonstrate complete cohesiveness and conciseness. The captioned sentences remain glued to each other and provide a proper connection.
Recommended Citation
Haque, A. u. (2025). The storyteller: Computer vision driven context and content transcription system for images056 (Unpublished doctoral dissertation). Institute of Business Administration, Pakistan. Retrieved from https://ir.iba.edu.pk/etd/95
The full text of this document is only accessible to authorized users.
