FraudCorpus: Multilingual Social Engineering Detection

Python

NLP

Machine Learning

Security

Corpus Linguistics

An end-to-end NLP pipeline for detecting phishing, spam, and social engineering attacks across English and Chinese text

Author

Tianhao Cao

Published

April 28, 2026

Project Overview

A multilingual annotated corpus and machine learning system for detecting social engineering fraud, developed as a course project for UBC COLX 523/581. The project spans the full NLP pipeline — from raw data collection and multi-layer human annotation to model training, active learning, and a live search interface.

Key Components

Corpus Construction: Sourced and preprocessed 250,000+ phishing/spam emails from Kaggle and live Reddit data via a custom multi-modal scraper (text + OCR). Final gold-standard corpus contains 105 human-annotated samples across English and Mandarin Chinese.
Three-Layer Annotation Scheme: Each document is labeled with (1) a coarse category (Ham / Phish / Spam), (2) a social engineering scenario (8 types: HR_RECRUITING, FINANCE_TRADING, IT_SYSTEM, etc.), and (3) a psychological tactic (5 types: AUTHORITY, URGENCY, THREAT, REWARD, HELPFUL_SERVICE), plus 12-class Named Entity Recognition (NER) spanning PERSON, ORG, EMAIL, URL, MONEY, and more.
Ensemble Model: Combined DistilBERT and TF-IDF + LinearSVC via weighted soft voting (30%/70%), achieving Macro-F1 = 0.8778 on the test set — a 37% improvement over the standalone neural baseline.
Multi-Task Learning: Shared DistilBERT encoder trained jointly on classification (3 classes) and NER (21 BIO tags), with auxiliary NER loss weight λ = 0.3 and silver labels generated by spaCy.
Active Learning: Applied query-by-uncertainty (margin sampling) over 1,104 bootstrapped pseudo-labeled samples to identify the most informative examples for human review, correcting labeling errors before final model training.
Full-Text Search Interface: FastAPI backend with Whoosh indexing supports bilingual search (English and Chinese via jieba segmentation), document upload, and corpus statistics.

Technical Stack

Python · PyTorch · HuggingFace Transformers (DistilBERT) · scikit-learn · spaCy · FastAPI · Whoosh · jieba · Pandas

Results Summary

Sprint	Method	Test Macro-F1
S5	TF-IDF + LinearSVC (baseline)	0.6405
S5	DistilBERT fine-tuned	0.5444
S6	Motivated Ensemble (soft vote)	0.8778
S7	Multi-Task DistilBERT (λ=0.3)	0.4353

Paper

Project Overview

Key Components

Technical Stack

Results Summary

Paper

Link to Repo