Membership Inference Attack on Fine-tuned LLMs

Python

LLM

NLP

Privacy & Security

Ensemble Methods

Dual-model feature fusion to detect training data membership in fine-tuned language models

Author

Tianhao Cao

Published

March 18, 2026

Project Overview

A privacy attack research project investigating whether text samples were included in the fine-tuning dataset of a language model — a core problem in AI safety and LLM privacy auditing. Submitted as a Kaggle competition through UBC COLX 531.

Key Concepts Applied

Dual-Model Feature Engineering: Extracted a 16-dimensional feature vector per sample using four models simultaneously — two fine-tuned SmolLM2 variants (135M, 360M) and their base checkpoints — capturing reference ratios, neighborhood perturbation gaps, and cross-model agreement signals.
Improved Neighborhood Attack: Replaced slow BERT mask-and-fill neighbor generation with fast token-drop perturbation (10% word dropout, 5 neighbors per sample), enabling large-scale inference without sacrificing signal quality.
Supervised Ensemble Classifier: Trained a GradientBoosting classifier (300 estimators, max depth 4) on standardized features to distinguish members from non-members based on learned memorization patterns.
Results: Achieved AUC 0.907 and TPR@FPR=0.1 of 0.697 on the validation split — a +55% improvement in AUC and +349% improvement in TPR over the raw-loss baseline.
Feature Importance Analysis: Identified that cross-model reference signals (loss_diff_ft, ref_diff_2) account for 86% of model importance, confirming that larger fine-tuned models produce stronger memorization signatures than neighborhood-level perturbation.

Technical Stack

Python · PyTorch · Hugging Face Transformers · scikit-learn · Pandas · NumPy

Results Summary

Method	AUC	TPR @ FPR=0.1
Baseline (Raw Loss)	0.584	0.155
Improved Neighborhood Attack	0.907	0.697

Link to repo

Github Repository