Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Sajjad Haider, Professor, Department of Computer Science

Keywords

Voice Interview Automation, Multilingual AI, Speech-to-Text, Text-to-Speech, Large Language Models, Urdu NLP, Conversational Agents, LiveKit, Analytics Dashboard

Abstract

Conducting interviews at scale is often time-consuming, resource-intensive, and difficult to manage, particularly when participants are geographically dispersed or communicate in differ ent languages. Existing interview solutions frequently rely on text-based interactions, limited language support, or specialized infrastructure, reducing accessibility and effectiveness across diverse populations. This creates challenges for researchers, educators, organizations, and hu man resource professionals seeking efficient methods for collecting qualitative and quantitative data. This project presents Vocalytics, a multilingual AI-powered voice interview platform de signed to automate interview administration, response collection, and result analysis. The plat form enables structured interviews to be conducted through natural voice conversations in both Urdu and English by integrating speech recognition, large language model-based conversa tional processing, text-to-speech synthesis, and automated data management within a unified web-based system. Interview creators can define questions, customize interviewer behavior, select an inter view language, and distribute interviews through shareable links. Respondents complete in terviews directly through their web browsers, eliminating the need for telephony infrastructure or manual administration. The platform automatically transcribes responses, stores collected data, and generates customizable analytics dashboards for both open-ended and multiple-choice questions. Functional testing demonstrated successful end-to-end interview execution in both supported languages, confirming the feasibility of the proposed approach.

Tools and Technologies Used

Frontend

  • Next.js (App Router) – v16.2.4
  • TypeScript
  • LiveKit React Components (@livekit/components-react)

Backend

  • Python
  • Flask – v3.1.0
  • Gunicorn (WSGI server) – v23.0.0
  • LiveKit Agents SDK – v1.5.4

AI / Speech Pipeline

  • Whisper Large V3 – Speech-to-Text (Urdu/English)
  • Llama 3.3 70B Versatile – Language Model (via livekit-plugins-groq)
  • Azure Cognitive Services – Text-to-Speech (ur-PK-UzmaNeural, en-US-AriaNeural)
  • Silero VAD – Voice Activity Detection

Database & Auth

  • Supabase (PostgreSQL) – v2.10.0 Python client
  • Supabase Auth – Email/password + JWT
  • Row-Level Security (RLS)

Real-Time Communication

  • LiveKit Cloud (Germany 2 region)

Deployment/Infrastructure

  • Vercel – Frontend hosting
  • Koyeb – Backend/agent hosting (Frankfurt, free tier)
  • GitHub – Version control

Other

  • Groq (LPU inference engine, used to serve Llama 3.3)

Methodology

Methodology

Vocalytics was developed using an iterative, component-integration-based approach, combining existing speech, language, and cloud infrastructure technologies into a unified voice interview pipeline rather than building AI models from scratch. The development followed these key stages:

1. System Architecture Design A three-tier architecture was adopted, separating the system into a frontend web application (Next.js), a backend API and conversational logic layer (Flask), and a database layer (Supabase/PostgreSQL). When a respondent opens an interview link, a session is created and a real-time voice channel is established via LiveKit, which then manages the conversation flow.

2. Conversational Agent Design (Perceive–Reason–Act Pipeline) The core AI interviewer was built around a three-stage pipeline:

  • Perceive (STT): Whisper Large V3 converts spoken Urdu/English into text in real time.
  • Reason (LLM): Llama 3.3 70B Versatile (served via Groq) processes the transcribed input, manages interview flow, interprets responses, and generates contextually relevant follow-up dialogue based on a dynamically generated system prompt (containing interview questions, creator instructions, and conversation history).
  • Act (TTS): Azure Cognitive Services converts the LLM's output into natural speech using language-specific neural voices (ur-PK-UzmaNeural / en-US-AriaNeural).

3. Interview State Management A tool-based function-calling mechanism was implemented so the LLM explicitly invokes a function after each valid response, which stores the answer and advances the interview to the next question. This was chosen over transcript-parsing approaches to improve reliability and reduce desynchronization between conversation state and stored data. Silero VAD was integrated to detect speech boundaries and improve conversational turn-taking, with synchronization controls added to prevent duplicate submissions.

4. Interview Configuration Framework A configurable design layer was built to let creators define interview title, description, language, interviewer instructions/tone, and an ordered question set (open-ended and multiple-choice), without needing to modify backend code.

5. Data Modeling A relational schema was designed in Supabase (PostgreSQL) with four core tables — profiles, interviews, interview_sessions, and responses. Row-Level Security (RLS) was applied to creator-facing tables, while session/response tables are managed server-side using a service role key to bypass RLS during interview execution.

6. Analytics Dashboard Development An automated visualization layer was built to process stored responses immediately after interview completion — generating word clouds, frequency charts, and tables for open-ended questions, and bar/pie/doughnut charts and frequency tables for multiple-choice questions, with theme customization support.

7. Deployment and Testing The system was deployed using a distributed cloud setup (Vercel for frontend, Koyeb for backend/agent, Supabase for database/auth, LiveKit Cloud for real-time voice) to simulate a realistic production environment. Development and local testing were conducted on a Windows 11 machine.

8. Functional Evaluation Rather than formal user studies, the system was validated through a set of functional test scenarios covering the full interview lifecycle: interview creation/configuration, English and Urdu interview execution, response collection (both open-ended and MCQ), automated storage/retrieval, and dashboard generation — verifying correct end-to-end integration across all major features.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Creative Commons License

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License
This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 4.0 International License.

Share

COinS