Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Imran Khan, Assistant Professor, Institute of Business Administration, Karachi

Keywords

HealthTech, AI Agent, Automation, Chatbot, Voice Agent

Abstract

Modern healthcare systems face significant operational inefficiencies, with professionals spending substantial time on manual documentation and administrative coordination rather than direct patient care. This paper proposes an AI agent-orchestrated digital shadow framework for healthcare workflow automation, designed to streamline routine clinical operations while maintaining reliability and transparency. The framework integrates multi-agent orchestration with a Digital Shadow—a real-time digital representation of administrative workflows that can observe, coordinate, and automate routine tasks. At its core is a modular orchestration layer where a centralized agent functions as an intent classifier and workflow coordinator, dynamically routing requests to specialized sub-agents: a triage agent conducting structured patient interviews and generating History of Present Illness (HPI) notes, an appointment booking agent managing scheduling through real-time database and API integration, and a general query agent utilizing Retrieval-Augmented Generation (RAG) against a verified medical knowledge base. To ensure longterm robustness, the framework incorporates a reflective monitoring component that continuously analyzes interaction patterns to detect model drift and workflow inefficiencies, enabling incremental improvement without disrupting live operations. To evaluate the framework’s core architectural claims, we conducted an offline benchmark on a dataset of 50 expert-annotated bilingual queries. Results demonstrate that the multi-agent pipeline yields substantially improved intent classification on Urdu and code-mixed inputs

Tools and Technologies Used

Python,Langchain,Langgraph,FastAPI,Supabase,React/Next.js,LLM

Methodology

A. Framework Overview The framework instantiates the Digital Shadow as a coordinated multi-agent system with two principal components: (1) the Agent Orchestration Layer executing all live interactions, and (2) the Reflective Monitoring Component operating as a passive background observer that logs metrics, flags low-confidence interactions, and surfaces unrecognized local vocabulary to an administrator dashboard for future model updates. No live fine-tuning or closedloop retraining is performed in the current implementation. Figure 1 illustrates the overall design. B. Local Language Dictionary (RAG Normalization) A persistent challenge in Pakistani clinical NLP is the gap between local patient vocabulary and the standard medical terminology expected by downstream models. Patients routinely describe symptoms using dialectal Urdu phrases (e.g., “pet mein dard” for abdominal pain, “aankhein jal rahi hain” for eye irritation) or informal English equivalents that standard LLMs either misclassify or fail to extract as structured symptoms. To address this, the framework incorporates a RAG-based local language dictionary as a preprocessing stage before the orchestrator interprets the user’s request. Each incoming utterance is matched against a curated index of dialectal expressions, regional slang, and common code-switched phrases, each mapped to a canonical SNOMED-aligned clinical term. If a match is found above a confidence threshold, the normalized term replaces the dialectal expression in the text forwarded to the leader agent. When expressions are unmatched, a secondary high-capacity LLM evaluates the surrounding conversational context to infer a candidate clinical meaning for the unknown term. However, to maintain strict clinical safety, these inferred mappings are not immediately deployed. Instead, they are placed in a human-in-the-loop verification queue on the administrator dashboard. Once a clinician or domain expert validates the LLM’s inference, the new mapping is appended to the RAG vector store. This design enables scalable, AIassisted vocabulary expansion while preserving human oversight, simultaneously reducing the cognitive load on the orchestration model and improving downstream routing accuracy. C. Agent Orchestration and Tool Integration A core design principle of the framework is minimizing redundant large language model (LLM) reasoning cycles to reduce latency. Rather than delegating deterministic database operations to independent sub-agents, the system utilizes a streamlined ”Orchestrator-and-Tools” architecture, reserving autonomous sub-agents exclusively for complex, multi-turn clinical reasoning tasks. 1) Leader Agent (Bilingual Orchestrator): The leader agent serves as the sole patient-facing interface, accepting normalized input in Urdu, English, or code-switched text. It operates as the central cognitive hub, utilizing native LLM tool-calling integrated with the Model Context Protocol (MCP). By evaluating the full conversational context, the model dynamically selects the appropriate tool to execute user requests. This design eliminates the brittleness of hardcoded routing rules and allows the orchestrator to gracefully handle ambiguous or multi-intent utterances by composing tool calls in sequence. 2) Integrated MCP Tools (Recommendation & Booking): Instead of deploying separate agents for scheduling and recommendations, these capabilities are exposed directly to the Leader Agent as typed MCP tools. When a patient requests a specialist, the Orchestrator triggers a RAG tool against the clinic’s knowledge base to retrieve and rank physician profiles. When a patient requests an appointment, the Orchestrator triggers an SQL tool to query live availability, allocate urgency-prioritized slots, and confirm the booking. Treating these operations as direct tool calls rather than agent hand-offs significantly reduces the system’s Total Turn-Around Time (TTAT) and token overhead. 3) Symptom Elicitation Sub-Agent: Unlike scheduling operations, clinical triage requires a stateful, multi-turn interview loop guided by strict medical safety guardrails. When the Orchestrator detects a symptom-reporting intent, it temporarily transfers conversational control to this specialized sub-agent. The Elicitation Agent administers a condition-specific branching questionnaire, dynamically adapting subsequent questions based on prior answers to ensure clinically relevant follow-up. Once the interview is complete, extracted information is mapped to structured HPI fields (Table I), persisted to the database, and control is returned to the Orchestrator. TABLE I EXAMPLE HPI DATA STRUCTURE Field Extracted Value chief complaint Chest pain ( ) onset 3 days ago character Sharp, stabbing severity 7/10 modifying factors Worse with deep breathing associated symptoms Shortness of breath language of interaction Urdu/English 4) Diagnostic Summarization Model: The structured HPI output is passed to a dedicated diagnostic summarization Fig. 1. AI Agent-Orchestrated Digital Shadow Framework. The bilingual leader agent coordinates four specialized sub-agents via standardized interfaces, with MCP-connected live database access and background reflective monitoring. model that generates a pre-consultation clinical report organizing the patient’s symptom profile, flagging clinical concerns, and suggesting examination priorities. This model runs asynchronously and its output is immediately available on the physician dashboard before the patient enters the consultation room. D. Model Selection: On-Premise Deployment Rationale The framework deliberately employs a localized pipeline of open-weights models with an aggregate system footprint of 40B to 50B parameters (encompassing the orchestrator LLM, the ASR model, and RAG embeddings), rather than relying on cloud-hosted proprietary APIs. This is a deliberate design decision driven by the extreme sensitivity of clinical data. Patient records, symptom histories, and appointment details constitute protected health information; routing this data through third-party cloud inference endpoints introduces unacceptable data governance risks for healthcare providers [12]. An aggregate 40B–50B parameter budget represents the practical sweet spot for localized deployment. By utilizing 4-bit quantization, the entire multi-model pipeline can be comfortably self-hosted on entry-level, commodity clinic hardware (e.g., a local server with dual 24 GB VRAM GPUs). This ensures that the orchestrator retains the reasoning capacity required for dynamic tool selection and structured JSON generation, while guaranteeing absolute data sovereignty—no patient data ever leaves the local network. Furthermore, this localized architecture makes the framework financially viable for resource-constrained healthcare settings in South Asia, where recurring cloud API costs and data privacy compliance are significant barriers to AI adoption. E. Evaluation Setup Prior work demonstrates that multi-agent LLM systems coordinating specialized agents outperform monolithic approaches on complex, multi-step tasks, particularly where subtasks benefit from role specialization and focused system prompts [13], [14]. However, this advantage is taskdependent: recent benchmarks including MedAgentBoard confirm that for deterministic or retrieval-bound operations, a single agent with a well-structured prompt can match multi-agent performance [15]. Our experimental design is structured to surface exactly this distinction across three task categories. We conducted an offline evaluation on a dataset of 50 spoken patient queries recorded to reflect realistic clinical intake scenarios: 15 English-only, 15 Urdu-only, and 20 code-mixed utterances. Each entry includes the goldstandard tool selection label and gold-standard extracted symptom fields. Both conditions use the same quantized open-weights model to isolate architectural differences rather than model capacity. Both are granted identical access to the same MCP tool-calling capabilities, RAG-based local language dictionary, and underlying databases. The monolithic baseline receives the raw audio and processes transcription, language normalization, symptom extraction, clinical judgment, and structured output formatting entirely within a single model pass. The multi-agent pipeline distributes these same tools across specialized stages: Whisper first transcribes the audio, the Orchestrator handles tool-bound requests via typed MCP calls, and conversational control transfers to the Symptom Elicitation Sub-Agent for clinical triage, whose structured HPI output feeds the Diagnostic Summarization Model asynchronously. To ensure a fair and scientifically valid evaluation, the transcription and clinical reasoning categories are tested in two separate phases. The transcription phase feeds raw audio to both conditions. The clinical reasoning phase provides both conditions with identical gold-standard text transcripts as input, completely removing transcription quality as a variable. Any performance gap in Category 3 therefore reflects purely architectural differences in reasoning, not differences in what the model heard. F. Multi-Agent Pipeline vs. Monolithic Baseline We evaluate both conditions across three task categories, each tested on English and Urdu/code-mixed queries separately. Category 1 , Tool-bound tasks (booking & recommendation): Both conditions have identical MCP tool access. The Orchestrator calls run_raw_sql for availability and a RAG tool for physician matching regardless of architecture. As shown in Table II, performance is near-identical across both language groups, an expected result, as tool-calling accuracy depends on the model’s native function-calling ability rather than orchestration topology. Category 2 , Transcription (ASR, audio input): In this phase, raw audio is fed to both conditions. The monolithic baseline handles transcription within the same model pass as all downstream tasks, receiving raw audio and producing a transcript, normalized clinical terms, and structured output simultaneously. The multi-agent pipeline delegates transcription to Whisper, a model specifically trained on multilingual speech, before any clinical reasoning begins. This specialization is particularly consequential on Urdu and code-mixed utterances, where a general-purpose LLM’s in-context transcription degrades significantly compared to a dedicated ASR model. Pakistani-accented English also introduces measurable degradation in the monolithic condition. Category 3 , Clinical reasoning (text input, reasoning isolated): To isolate architectural reasoning ability from transcription noise, both conditions in this category receive identical gold-standard text transcripts as input, the same clean text fed to both. Any performance gap here therefore reflects purely the difference between monolithic and decomposed prompting strategies. The monolithic baseline must conduct a multi-turn symptom interview and produce a structured HPI summary in a single context window under one composite prompt. The multi-agent pipeline decomposes this: the Elicitation Sub-Agent operates under strict clinical guardrails with a focused single-purpose prompt, and its structured output is passed to a dedicated Summarization Model. This decomposition eliminates context interference. The performance gap is substantial, particularly on Urdu input, where the monolith simultaneously manages codeswitching, symptom extraction, and JSON formatting, a cognitive load that causes field omissions and symptom conflation. TABLE II MULTI-AGENT VS. MONOLITHIC BASELINE BY TASK CATEGORY (OFFLINE EVALUATION, n=50) Metric Multi-Agent Monolithic Category 1: Tool-bound tasks (audio input) Tool Sel. Accuracy (English) 96.4% 96.4% Tool Sel. Accuracy (Urdu) 93.8% 93.8% DB Query Success Rate 98.1% 98.1% Doc. Rec. Relevance@3 87.3% 87.3% Category 2: Transcription accuracy (audio input) Transcription Acc. (English) 93.2% 76.8% Transcription Acc. (Urdu) 90.0% 54.4% Transcription Acc. (Mixed) 85.0% 52.1% Category 3: Clinical reasoning (gold-standard text input) HPI Completeness (English) 94.2% 76.5% HPI Completeness (Urdu) 91.8% 61.2% HPI Clinical Accuracy 88.6% 64.1% G. Impact of Local Language Normalization (Ablation Study) To validate the necessity of the RAG-based local language dictionary, we conducted an ablation study on the 35 nonEnglish queries under two conditions: (1) Raw Input, bypassing the normalization layer entirely, and (2) Normalized Input, using the RAG dictionary to map dialectal expressions to standard clinical vocabulary before routing. As shown in Table III, removing normalization causes severe degradation. Without it, the Orchestrator frequently misroutes code-switched symptom descriptions as general information queries. HPI completeness drops by 22.8 points, as the extraction stage cannot map colloquial terms (e.g., “pet mein dard,” “sir bhaari”) to standard JSON fields. The safety override rate nearly triples, since lowconfidence routing on unrecognized vocabulary defaults conservatively to human referral. This confirms that the multi-agent architecture alone is insufficient for Pakistani clinical deployments, the vocabulary grounding layer is a necessary prerequisite. TABLE III ABLATION: RAG NORMALIZATION ON URDU AND CODE-MIXED QUERIES (n=35) Metric With Norm. Without Norm. Tool Sel. Accuracy (Urdu) 93.8% 88.2% Tool Sel. Accuracy (Mixed) 91.7% 88.5% HPI Field Completeness 92.4% 87.6% H. Clinical Safety and Architectural Guardrails Clinical safety is managed through system-level architectural boundaries rather than relying on model capacity alone. The framework is designed strictly as an administrative and triage assistant, not a diagnostic replacement. The Symptom Elicitation Sub-Agent is hard-prompted to never offer medical advice, diagnoses, or treatment plans to the patient. Its sole function is structured data collection. The Diagnostic Summarization Model does perform preliminary differential synthesis, matching extracted symptoms against known disease profiles, but this output is routed exclusively to the physician’s dashboard as a pre-consultation aid. A licensed physician reviews and verifies all AI-generated summaries prior to the consultation, enforcing a mandatory human-inthe-loop (HITL) checkpoint at every clinical decision point. Hallucination risk is mitigated structurally rather than measured as a discrete metric, as reliable hallucination detection in clinical text requires expert annotation beyond the scope of this evaluation. Both the patient-facing elicitation agent and the backend summarization model are constrained by the verified RAG knowledge base, preventing generation of unsupported clinical correlations. Prior work demonstrates that decomposing complex clinical queries into single-purpose prompts reduces the context interference that induces hallucinations in monolithic models [13]. When the Orchestrator’s routing confidence falls below a defined threshold, the system defaults to a conservative safety override: the patient receives the message “Please consult a human doctor immediately” and the interaction is flagged for clinician review. On the 50-query evaluation set, 6% of interactions triggered this override, all corresponding to genuinely ambiguous or out-of-vocabulary inputs.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Share

COinS