Degree
Bachelor of Science (Computer Science)
Department
Department of Computer Science
School
School of Mathematics and Computer Science (SMCS)
Advisor
Dr. Muhammad Saeed, Visiting Faculty, Department of Computer Science
Co-Advisor
Umair Nazir, Manager at Securiti.ai
Keywords
Multimodal AI, Video Intelligence, Retrieval-Augmented Generation, Speech Recognition, Natural Language Processing, Machine Learning, Content Analysis
Abstract
VidSense represents a comprehensive multimodal AI platform that leverages Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), and advanced Natural Language Processing techniques to extract intelligent insights from video content. The platform addresses the growing need for automated video content analysis in an era where video consumption has exponentially increased across educational, corporate, and entertainment domains. The project's core innovation lies in its integration of multiple AI technologies: OpenAI Whisper for robust speech-to-text transcription, custom sentence transformers for semantic understanding, and RAG architecture with vector embeddings for context-aware video question-answering. The system supports both YouTube video URLs and local file uploads, making it versatile for various use cases. Key capabilities include automated generation of video summaries, extraction of key moments with timestamps, creation of highlight reels for social media, AI-powered podcast generation from video content, intelligent dubbing and subtitling systems, and comprehensive meeting minutes generation. The platform demonstrates significant advancement in multimodal AI by seamlessly bridging audio, visual, and textual content understanding. Experimental validation shows the system successfully processes videos in multiple languages, maintains temporal coherence in generated content, and provides contextually relevant responses to user queries. The RAG implementation enables precise information retrieval with timestamp accuracy, while the modular architecture ensures scalability and maintainability. Results indicate superior performance in content extraction tasks compared to traditional video analysis tools, with processing times optimized through algorithmic and LLM-based hybrid approaches.
Tools and Technologies Used
Programming Languages:
- Python 3.9+
AI/ML Frameworks:
- OpenAI Whisper Large-v3-Turbo (speech recognition)
- Sentence Transformers (semantic embeddings)
- Ollama with DeepSeek-R1 7B (local LLM processing)
- HuggingFace Transformers
- Chroma Vector Database
- PyTorch
- CUDA Toolkit 11.8+
Web Development:
- FastAPI (REST API framework)
- Uvicorn (ASGI server)
- Pydantic (data validation)
- AsyncIO (asynchronous processing)
Media Processing:
- FFmpeg (video/audio manipulation)
- MoviePy 2.0+ (video editing)
- yt-dlp (YouTube downloading)
Text-to-Speech:
- Google Text-to-Speech (gTTS)
- Edge TTS
- Multi-engine TTS integration
Data Processing:
- NumPy (numerical computations)
- Pandas (data manipulation)
- Scikit-learn (machine learning utilities)
- NLTK/spaCy (natural language processing)
Development Tools:
- Git (version control)
Methodology
The VidSense project employed a comprehensive software engineering methodology combining agile development practices with rigorous AI system design principles.
System Architecture Approach: The development followed a modular microservices architecture pattern using FastAPI, enabling independent scaling and optimization of individual components. The core design implemented an event-driven architecture with a central orchestration engine coordinating multiple specialized AI components through asynchronous processing capabilities.
AI Integration Methodology: A hybrid processing approach was implemented, combining algorithmic methods with Large Language Model (LLM) processing to optimize both speed and quality. The system employed a sophisticated cascading transcription strategy, prioritizing YouTube's native transcript API for accuracy before falling back to OpenAI Whisper for local processing.
Retrieval-Augmented Generation (RAG) Implementation: The RAG framework was designed with multi-tier semantic matching, utilizing sentence transformers to create dense vector representations of transcript segments. These were stored in optimized vector indexes (Chroma database) for rapid semantic retrieval during query processing, addressing hallucination problems common in pure LLM approaches.
Development and Testing Framework: The project utilized comprehensive testing methodologies including unit testing, integration testing, and performance benchmarking. Evaluation datasets comprised 120+ videos across educational, corporate, and entertainment domains in multiple languages, with ground truth establishment through manual annotation by domain experts and professional transcription services.
Performance Optimization Strategy: Advanced caching strategies were implemented including MD5-based response caching, model singleton patterns, and intelligent memory management. The system employed parallel processing through ThreadPoolExecutor and asyncio integration for non-blocking operations, achieving sub-real-time processing speeds.
Quality Assurance Methodology: Validation included accuracy metrics (Word Error Rate comparison), efficiency metrics (processing time per video minute), and quality metrics (semantic similarity scores between generated and human-created content). Cross-language performance consistency was maintained across 5 languages with 85-94% feature parity.
Document Type
Restricted Access
Submission Type
BSCS Final Year Project
Recommended Citation
Taha, M., Arif, M. M., Wasay, M., Iqbal, A., & Ali, S. B. (2025). VidSense. Retrieved from https://ir.iba.edu.pk/fyp-bscs/15
COinS