Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Tasbiha Fatima, Lecturer, Department of Computer Science, Institute of Business Administration, Karachi

Co-Advisor

Wajeeha Javed, Lead DataScientist at VentureDive

Keywords

Talent Acquisition, Candidate Ranking, Semantic Matching, Natural Language Processing, Vector Embeddings, Resume Parsing, Information Retrieval

Abstract

This project presents an AI-driven talent matching platform that automates the end-to-end recruitment pipeline: from resume parsing to candidate-job ranking, through a combination of large language model (LLM) extraction, multi-entity semantic search, and hybrid scoring. The system addresses the inefficiency of manual resume screening by decomposing candidate profiles and job descriptions into five semantic entities (core summary, skills, experience, projects, education), embedding each independently using sentence transformers, and performing weighted vector retrieval via a dedicated vector database. A layout-aware ETL pipeline handles multi-column PDF resumes through automatic column boundary detection, while BAML-enforced schema validation guarantees deterministic, structured extraction from the LLM without post-hoc JSON parsing. The ranking layer combines semantic similarity scores with seven configurable business features: including skill coverage, experience fit, education level, location, and employment type compatibility, under a recruiter-configurable weight system with hard-rule gating for disqualification criteria. All ranking parameters are runtime-configurable through a three-tier priority hierarchy (inline request overrides, saved recruiter configurations, system defaults), eliminating the need for code changes when tuning matching behavior. The platform provides full explainability through per-entity score breakdowns, feature contribution logs, and a persistent audit trail that tracks every ranking run for reproducibility. A built-in evaluation framework compares ranking strategies using standard information retrieval metrics (NDCG@K, Precision@K, MRR). The key contributions are: (1) a multi-entity vector decomposition approach enabling granular semantic matching beyond single-embedding similarity, (2) a hybrid ranking architecture that balances semantic relevance with domain-specific business rules under full recruiter control, and (3) end-to-end transparency from extraction through scoring, supporting informed hiring decisions.

Tools and Technologies Used

Python, TypeScript, JavaScript, FastAPI, Express.js, Next.js, React, Node.js, PostgreSQL, Prisma ORM, Qdrant (Vector DB), Docker, Sentence-Transformers, PyTorch, BAML, Groq LLM API, Pydantic, Redux Toolkit, Tailwind CSS, Shadcn UI, MLflow, JWT, pdfplumber

Methodology

The project follows an incremental and iterative development methodology, structured into distinct phases that each deliver a functional, testable component before the next phase begins.

Phase 1: Data Collection & ETL Pipeline (Feb 2026): Development began with collecting candidate resume data (PDF/DOCX) and job descriptions, followed by building a multi-stage ETL pipeline: text extraction (pdfplumber, python-docx), layout analysis with automatic multi-column detection, text cleaning, and LLM-based structured extraction using BAML schema enforcement with Groq GPT-OSS 120B. Each stage was independently validated using Pydantic models before integration.

Phase 2: Semantic Matching Engine (Mar 2026): A multi-entity vector decomposition approach was designed, where each candidate/job is split into five semantic entities (core, skills, experience, projects, education). Sentence-transformer embeddings are generated per entity and stored in Qdrant vector database. Weighted vector search with configurable entity weights enables retrieval of semantically relevant candidates for a given job or vice versa.

Phase 3: Ranking & Scoring Layer (Mar–Apr 2026): A hybrid ranking service was implemented combining semantic similarity scores with seven business feature scores (skill coverage, experience match, education level, location, employment type, workspace preference). Multiple ranking strategies (semantic-only, weighted hybrid, RRF, cross-encoder, LLM ranker) were developed with a strategy evaluation framework using NDCG@K, Precision@K, and MRR metrics tracked via MLflow.

Phase 4: Backend API & Authentication (Apr 2026): A Node.js/Express REST API was built with Prisma ORM for PostgreSQL, implementing JWT-based authentication (access + refresh tokens), role-based authorization (candidate, recruiter, admin), rate limiting, and structured request-response logging with correlation IDs across services.

Phase 5: Frontend Development (Apr–May 2026): Role-specific dashboards were developed using Next.js and React with TypeScript — candidate profile management and job discovery, recruiter job posting and applicant ranking, and admin system monitoring with ranking configuration and strategy evaluation interfaces.

Phase 6: Integration, Deployment & Testing (Jun 2026): All services were containerized using Docker Compose, deployed on AWS EC2, and integration-tested end-to-end. Unit tests cover the text cleaning pipeline, semantic builders, business scorer, and API middleware. The system was iterated upon based on production performance observations (worker tuning, timeout adjustments).

Development Practices: - Microservice architecture with clear separation (backend API, AI service, frontend) - Configuration-driven design, all scoring parameters runtime-configurable without code changes - Explainability-first approach, every ranking result includes full score breakdowns and audit trails - Docker-based development environment ensuring reproducibility across machines

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Share

COinS