Membership Inference Attack on Fine-tuned VLMs

Python
VLM
NLP
Privacy & Security
Multimodal
Ensemble Methods
Multimodal attack pipeline combining CLIP embeddings, loss-ratio signals, and neighborhood perturbation to detect training data membership in vision-language models
Author

Tianhao Cao

Published

April 18, 2026

Project Overview

A multimodal privacy attack research project investigating whether (image, caption) pairs were included in the fine-tuning dataset of a vision-language model — extending membership inference attack (MIA) research from text-only LLMs to the VLM setting. Target model: SmolVLM-256M-Instruct.

Key Concepts Applied

  • Three-Layer MIA Attack: Probed the fine-tuned model through three perturbation strategies — loss ratio (fine-tuned vs. base model loss), caption contrast (original vs. randomly shuffled captions), and image corruption sensitivity (clean vs. Gaussian-blurred image). The loss ratio alone achieved AUC 0.815, serving as the dominant single signal.
  • Adapted M⁴I Framework: Modernized the M⁴I attack (Hu et al., NeurIPS 2022) for CLIP + SmolVLM — replaced legacy ResNet-152/LSTM shadow models with off-the-shelf CLIP ViT-B/32 as a cross-modal feature extractor, achieving Kaggle AUC 0.8056 with no shadow model training required. Identified clip_img_text_sim as the single most important feature (48% of XGBoost importance).
  • Dual-Modal Neighborhood Attack: Extended the text-only neighborhood attack to both modalities — generating text perturbations (10% word dropout) and image perturbations (Gaussian noise, patch masking, random crop) across fine-tuned and base models to extract 17 cross-modal sensitivity features.
  • Combined Ensemble: Merged the three-layer MIA (3 features) and dual-modal neighborhood features (17 features) via Logistic Regression, demonstrating that orthogonal signal families — absolute memorization vs. perturbation sensitivity — are complementary.
  • Key Finding: VLM generation-based signals add near-zero discriminative value (Δ Kaggle AUC = +0.00023 vs. CLIP-only), indicating the membership signal is encoded in data-level image-text alignment rather than model output behavior.

Technical Stack

Python · PyTorch · Hugging Face Transformers · CLIP (ViT-B/32) · XGBoost · scikit-learn · Pandas · NumPy

Results Summary

Method Val AUC Val TPR @ FPR=0.1
Baseline (Raw Loss) 0.501 0.120
Neighborhood Attack 0.599 0.187
M⁴I-CLIP Attack 0.809 0.300
Three-Layer MIA only 0.815 0.355
Hybrid MIA + LiRA 0.806 0.263
Neighborhood + MIA Combined 0.849 0.388

Paper

Unable to display PDF file. Download instead.