Membership Inference Attack on Fine-tuned LLMs

Python
LLM
NLP
Privacy & Security
Ensemble Methods
Dual-model feature fusion to detect training data membership in fine-tuned language models
Author

Tianhao Cao

Published

March 18, 2026

Project Overview

A privacy attack research project investigating whether text samples were included in the fine-tuning dataset of a language model — a core problem in AI safety and LLM privacy auditing. Submitted as a Kaggle competition through UBC COLX 531.

Key Concepts Applied

  • Dual-Model Feature Engineering: Extracted a 16-dimensional feature vector per sample using four models simultaneously — two fine-tuned SmolLM2 variants (135M, 360M) and their base checkpoints — capturing reference ratios, neighborhood perturbation gaps, and cross-model agreement signals.
  • Improved Neighborhood Attack: Replaced slow BERT mask-and-fill neighbor generation with fast token-drop perturbation (10% word dropout, 5 neighbors per sample), enabling large-scale inference without sacrificing signal quality.
  • Supervised Ensemble Classifier: Trained a GradientBoosting classifier (300 estimators, max depth 4) on standardized features to distinguish members from non-members based on learned memorization patterns.
  • Results: Achieved AUC 0.907 and TPR@FPR=0.1 of 0.697 on the validation split — a +55% improvement in AUC and +349% improvement in TPR over the raw-loss baseline.
  • Feature Importance Analysis: Identified that cross-model reference signals (loss_diff_ft, ref_diff_2) account for 86% of model importance, confirming that larger fine-tuned models produce stronger memorization signatures than neighborhood-level perturbation.

Technical Stack

Python · PyTorch · Hugging Face Transformers · scikit-learn · Pandas · NumPy

Results Summary

Method AUC TPR @ FPR=0.1
Baseline (Raw Loss) 0.584 0.155
Improved Neighborhood Attack 0.907 0.697