FraudCorpus: Multilingual Social Engineering Detection
Python
NLP
Machine Learning
Security
Corpus Linguistics
An end-to-end NLP pipeline for detecting phishing, spam, and social engineering attacks across English and Chinese text
Project Overview
A multilingual annotated corpus and machine learning system for detecting social engineering fraud, developed as a course project for UBC COLX 523/581. The project spans the full NLP pipeline — from raw data collection and multi-layer human annotation to model training, active learning, and a live search interface.
Key Components
- Corpus Construction: Sourced and preprocessed 250,000+ phishing/spam emails from Kaggle and live Reddit data via a custom multi-modal scraper (text + OCR). Final gold-standard corpus contains 105 human-annotated samples across English and Mandarin Chinese.
- Three-Layer Annotation Scheme: Each document is labeled with (1) a coarse category (Ham / Phish / Spam), (2) a social engineering scenario (8 types: HR_RECRUITING, FINANCE_TRADING, IT_SYSTEM, etc.), and (3) a psychological tactic (5 types: AUTHORITY, URGENCY, THREAT, REWARD, HELPFUL_SERVICE), plus 12-class Named Entity Recognition (NER) spanning PERSON, ORG, EMAIL, URL, MONEY, and more.
- Ensemble Model: Combined DistilBERT and TF-IDF + LinearSVC via weighted soft voting (30%/70%), achieving Macro-F1 = 0.8778 on the test set — a 37% improvement over the standalone neural baseline.
- Multi-Task Learning: Shared DistilBERT encoder trained jointly on classification (3 classes) and NER (21 BIO tags), with auxiliary NER loss weight λ = 0.3 and silver labels generated by spaCy.
- Active Learning: Applied query-by-uncertainty (margin sampling) over 1,104 bootstrapped pseudo-labeled samples to identify the most informative examples for human review, correcting labeling errors before final model training.
- Full-Text Search Interface: FastAPI backend with Whoosh indexing supports bilingual search (English and Chinese via jieba segmentation), document upload, and corpus statistics.
Technical Stack
Python · PyTorch · HuggingFace Transformers (DistilBERT) · scikit-learn · spaCy · FastAPI · Whoosh · jieba · Pandas
Results Summary
| Sprint | Method | Test Macro-F1 |
|---|---|---|
| S5 | TF-IDF + LinearSVC (baseline) | 0.6405 |
| S5 | DistilBERT fine-tuned | 0.5444 |
| S6 | Motivated Ensemble (soft vote) | 0.8778 |
| S7 | Multi-Task DistilBERT (λ=0.3) | 0.4353 |