Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Rizwan Ahmed Khan, Professor, Department of Computer Science

Co-Advisor

Hamza Usman Ghani - Lead Data Scientist

Keywords

Enterprise Document Automation, Intelligent Document Processing, Data Extraction, Human-in-the-Loop Learning

Abstract

Enterprise document processing still relies heavily on manual data entry, while many existing Intelligent Document Processing (IDP) systems are expensive, cloud-dependent, or unreliable for financial-grade auditing. This project presents FADE (Financial Agentic Document Extraction), an end-to-end intelligent document processing platform designed to convert heterogeneous business documents into validated, structured, and auditable data. The system follows a “fast by default, intelligent by exception” architecture. A deterministic high-speed extraction layer using OCR and spatial heuristics processes the majority of well-structured documents locally and efficiently, while a multimodal reconciliation layer is invoked only when validation checks detect inconsistencies such as missing values or decimal-slip errors. This framework guarantees financial accuracy, minimizes needless expenses related to AI inferences, and permits scalability. Audit logs are used to track each step of the procedure, guaranteeing that dependability and compliance are upheld. FADE has been created by trial and error. Early iterations with OCR + LLM have shown that local document extraction is possible, but there were certain limitations associated with flat text processing and vision-based only approaches for structured financial documents. This insight contributed to the design of the final two-tiered model. Beyond extraction, the platform includes role-based access control, human-in-the-loop review, and analytics. The platform is also multi-domain and schema-driven, allowing the same platform to support invoices, contracts, HR records, and other enterprise documents without code modifications.

Tools and Technologies Used

Python, JavaScript, SQL, Bash, FastAPI, SQLAlchemy, Pydantic, LangGraph, BAML, uvicorn, JWT, passlib/bcrypt, PostgreSQL 16, PaddleOCR, Tesseract, EasyOCR, Docling, pdfplumber, poppler, LayoutParser, Detectron2, spaCy, HuggingFace Transformers, Gemini 2.5 Flash, Qwen2-VL-7B, bitsandbytes, vLLM, pandas, NumPy, scikit-learn, statsmodels, Plotly, ARIMA, Prophet, Random Forest, Isolation Forest, React, react-router-dom, recharts, Context API, plain CSS (BEM-lite), nginx, Docker, Docker Compose, BuildKit, Git, GitHub, pytest, VS Code.

Methodology

The project was executed in two phases combining research exploration and iterative development. The initial phase explored Intelligent Document Processing inspired by JPMorgan DocLLM for layout-aware financial extraction. Since DocLLM was not deployable, core ideas were replicated using Open-DocLLM (OCR + LLM). An offline pipeline was built using Tesseract OCR, Mistral 7B via Ollama, and FastAPI orchestration. Experiments included pypdfium2 for PDF parsing and LLaVA 7B/13B for vision-language inference. OCR+LLM performed reasonably on English receipts but lost layout in flat-text form, multilingual performance degraded, and vision models hallucinated on financial data. A zone-aware 3×3 spatial extraction using TSV coordinates was introduced, improving structural awareness but remaining sensitive to parsing errors. In the second phase, we explored GOT-OCR 2.0, IBM Docling, and Qwen2-VL-7B to build a more advanced pipeline. However, the single-tier design was inefficient because every document was processed through a heavy VLM, making it costly and slow for simple cases. This highlighted the need to move toward a more efficient, hybrid pipeline instead of a fully model-driven one. It follows a “fast by default, intelligent by exception” design. Tier 1 uses PaddleOCR PP-OCRv5 Mobile (CPU), spatial heuristics, regex, and bounding-box matching to extract fields in under 800ms. A Financial Auditor performs decimal-safe arithmetic, consistency checks, and anomaly detection. If discrepancies are detected, Tier 2 triggers Gemini 2.5 Flash with BAML types for targeted reconciliation. LangGraph orchestrates a deterministic state machine (preprocess, OCR, audit, reconcile, persist) with JSONB traceability logs while HITL corrections become labeled data for calibration and future fine-tuning.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Share

COinS