Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Muhammad Atif Tahir, Professor and Program Coordinator, Graduate & Postgraduate Programs (CS), Department of Computer Science

Keywords

Face-Voice Association, Cross-Modal Verification, Deepfake Detection, Multilingual Biometrics

Abstract

Biometric verification systems increasingly benefit from the integration of multiple modalities. However, most existing FVA approaches assume monolingual settings and do not address the growing threat of synthetic media. Performance degradation across languages and vulnerability to deepfakes remain significant challenges for real-world deployment. This report presents EchoMatch, a deepfake-aware cross-modal biometric verification framework that determines whether a given face image and voice sample belong to the same individual. The system adopts a two-stage architecture consisting of a deepfake integrity gate followed by an FVA module. Three association approaches are evaluated under a common multilingual protocol using the MAV-Celeb English-Urdu dataset: XM-ALIGN, RFOP, and a dual-encoder architecture based on DINO and HuBERT. Evaluation is conducted using the unseen-unheard protocol, with EER as the primary performance metric. Experimental results show that XM-ALIGN achieved the best performance with an EER of 0.3043, followed by the proposed DINO+HuBERT approach with an EER of 0.3145 and RFOP with an EER of 0.3210. The findings indicate that shared-classifier alignment remains highly effective for multilingual face-voice verification, while self-supervised representations provide competitive cross-lingual performance. The integration of image and audio deepfake detection further enhances system reliability by filtering synthetic inputs prior to verification. The work contributes a unified evaluation of contemporary FVA methods, a self-supervised dual-encoder baseline, and a complete deepfake-aware verification pipeline for multilingual biometric authentication.

Tools and Technologies Used

PyTorch, Python, IResNet-18, ECAPA-TDNN, DINO ViT-B/16, HuBERT-base, Xception, WavLM, VGGFace, NT-Xent contrastive loss, Orthogonal Projection (OP) loss, MAV-Celeb V1-EU dataset, DFDC dataset, AUDETER, ASVspoof2019, Kaggle

Methodology

EchoMatch was developed using a two-stage, deepfake-aware verification methodology that combines an authenticity check with a comparative evaluation of cross-modal face-voice association techniques. Input media first passes through a deepfake integrity gate, where an Xception network fine-tuned on DFDC screens the image stream and a WavLM-based detector fine-tuned on AUDETER, VoxCeleb, and a public deepfake-audio corpus screens the audio stream. Only media that passes both checks proceeds to identity verification.

  For the verification task, three cross-modal association methods were implemented and evaluated under identical conditions. XM-ALIGN pairs an IResNet-18 face encoder with an ECAPA-TDNN voice encoder, using a shared classifier for implicit alignment along with a weak MSE loss, and is trained with cross-entropy loss plus MUSAN and RIR augmentation. The proposed DINO+HuBERT method uses a DINO ViT-B/16 face encoder and a HuBERT-base speech encoder, with projection heads producing L2-normalised embeddings, trained using a combined cross-entropy, NT-Xent contrastive, and MSE loss with differential learning rates. RFOP pairs a VGGFace encoder with ECAPA-TDNN and focuses on attention-weighted fusion combined with an Orthogonal Projection loss, MSE alignment, and cross-entropy.   All three methods were evaluated using the unseen-unheard protocol from the FAME challenge on the MAV-Celeb V1-EU dataset. Each model was trained on one language and tested on both a heard-language set and an unheard-language set, with disjoint train and test identities. This allowed the team to separately assess identity generalization and language generalization. Equal Error Rate served as the primary evaluation metric, with ROC-AUC and accuracy tracked as secondary diagnostics.   Models were implemented in PyTorch and trained on Kaggle's free GPU tier using NVIDIA T4 and P100 instances. Training used the AdamW optimizer with a cosine learning-rate schedule and gradient clipping. Pretrained encoders were reused wherever possible, most notably an ECAPA-TDNN voice encoder pretrained on VoxCeleb and shared across XM-ALIGN and RFOP, so that performance differences would reflect alignment strategy rather than encoder quality.   The final design was also shaped through iterative experimentation. Approaches such as direct cosine-similarity matching, multi-task language conditioning, triplet-loss training, and voice-to-face generation were explored and mostly abandoned. These experiments informed key choices in the final design, including the preferred face and voice backbones and the eventual adoption of the NT-Xent contrastive formulation for DINO+HuBERT.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS