Loading...

Media is loading
 

Degree

Bachelor of Science (Computer Science)

Department

Department of Computer Science

School

School of Mathematics and Computer Science (SMCS)

Advisor

Dr. Sajjad Haider, Professor, Department of Computer Science

Co-Advisor

Syed Mohammad Sualeh Ali, Visiting Faculty, Department of Computer Science

Keywords

Vision Language Models, Multimodal Privacy Detection, Personally Identifiable Information, Visual Privacy, Multimodal Evaluation, Cross-Modal Amplification, Privacy Benchmarking

Abstract

The proliferation of social media has created a new class of privacy risk that existing automated tools are ill-equipped to handle: compound disclosure arising from the co-occurrence of an image and its accompanying caption. Image-level benchmarks classify sensitive visual attributes in isolation; text-based tools infer personal attributes from written content alone. Neither captures the cross-modal amplification effect, where a photograph and a caption together reveal information that neither would disclose individually. This gap matters because captions routinely introduce personal disclosures - names, employers, relationships, locations - that are entirely absent from the image itself. This project addresses that gap through two complementary contributions. First, we construct PRIVCAP, a multimodal privacy corpus spanning three established visual privacy benchmarks: VISPR (8,000 test images; 68 attributes), PrivacyAlert (1,554 images), and DIPA2 (265 highconfidence images), totalling 9,819 image-caption pairs. For each image, a two-stage Gemini pipeline generates a social-media-style explicit caption embedding privacy-leaking phrases, and a contrastive no-leak caption preserving scene context without personal disclosure. Annotations operate at two levels: XML span tags around leaking phrases and bounding-box labels for visually grounded disclosures. Second, building on the evaluation protocol of Tsaprazlis et al. Tsaprazlis et al. [2024], we benchmark four Vision-Language Models - Ministral-3B, Gemma-3-4B, Qwen3-VL-8B, and LLaMA-3.2-11B - across four tasks and three input conditions. Image-only results reveal three consistent failure modes: pervasive safe-bias, universal Biometric Data blindness, and taxonomy-guided response instability. Qwen3-VL-8B achieves 47.9% macro F1 on attribute recognition, a substantial generational improvement over prior baselines of 18% and 27%. Image-conditioned evaluation (Tasks 1 to 3) is reported in full; Task 4, caption-source attribution, is defined as part of the evaluation framework and left for future work.

Tools and Technologies Used

Python, Google Colab, LM Studio, HuggingFace Transformers, PyTorch, CUDA, NVIDIA RTX 4050, A100 GPU, T4 GPU, L4 GPU, Qwen-VL/Qwen3-VL, LLaMA-3.2- 11B-Vision-Instruct, Gemma-3-4B-IT, Ministral-3-3B, Gemini API, Florence-2, LM Studio, VISPR dataset, PrivacyAlert dataset, DIPA2 dataset, NumPy, pandas, scikit-learn, openpyxl, Google Drive, JSON annotations, Excel result export, bfloat16 inference, majority voting, temperature sampling

Methodology

The project follows an experimental benchmarking methodology structured in two phases. In the first phase, we mapped annotations from three datasets: VISPR, PrivacyAlert, and DIPA2, to a shared 14 category privacy taxonomy derived from Tsaprazlis et al. (2025). Each image was prepared with ground truth privacy labels, and four VLMs were evaluated on three tasks: direct privacy-risk detection (Task 1), taxonomy-guided binary detection (Task 2), and taxonomy-guided multi-label attribute recognition (Task 3). Each image was passed to the model with task-specific hierarchical prompts, and predictions were collected across three runs using majority voting to improve reliability. Two temperature settings (0.1 and 1.0) were tested to study response consistency. Outputs were evaluated using Macro F1, precision, recall, per-category F1, and accuracy, and exported to JSON and Excel for analysis. In the second phase, a multimodal dataset was curated by pairing VISPR images with structured captions generated through a multi-stage Gemini Flash-Lite pipeline, producing image-text pairs across two PII leakage levels: explicit and no leak. Ground truth was reconstructed to reflect both image-grounded and caption induced privacy labels, enabling evaluation of cross-modal signal integration. Models were re-evaluated on Tasks 1 through 3 with captions injected into prompts, and a novel Task 4 was introduced requiring models to attribute privacy violations to the image channel, the caption channel, or both. Results across both phases were compared to establish where zero-shot VLMs succeed and fail under multimodal conditions, motivating future fine-tuning work.

Document Type

Restricted Access

Submission Type

BSCS Final Year Project

Share

COinS