Fraud and Anomaly Detection with Deep Learning

DSS5104

Overview

Fraud detection and anomaly detection are critical applications of machine learning in finance, cybersecurity, and healthcare. These problems share a common structure: the events you care about (fraudulent transactions, network intrusions) are rare, and the cost of missing them is high. This creates a set of challenges, extreme class imbalance, threshold sensitivity, and asymmetric error costs, that standard classification approaches often handle poorly.

Classical methods such as isolation forests, one-class SVM, and gradient boosting with class weights have been the workhorses of anomaly detection. More recently, deep learning approaches, particularly autoencoders and variational autoencoders (VAEs), have been proposed as alternatives that can learn richer representations of “normal” behavior and flag deviations from it.

In this assignment, you will build and evaluate both classical and deep learning-based anomaly detection systems on real-world imbalanced datasets. The central question is practical: do deep learning methods provide meaningful improvements over simpler approaches, and at what cost?

Assignment Objectives

Understand the specific challenges of learning from imbalanced data
Implement and compare classical anomaly detection methods with deep learning approaches
Engineer domain-specific features and measure their impact on performance
Evaluate models using metrics appropriate for imbalanced settings (not just accuracy)
Analyze the practical trade-offs between false positives and false negatives through cost-sensitive evaluation
Develop intuition for threshold selection in realistic deployment scenarios

Methods

You should implement and compare methods from both paradigms.

Classical Baselines

Isolation Forest: tree-based anomaly scoring
Local Outlier Factor (LOF): density-based anomaly detection
Gradient Boosting (e.g., XGBoost or LightGBM) with appropriate class weighting or oversampling (SMOTE)
Optionally: One-Class SVM or Elliptic Envelope

Deep Learning Approaches (choose at least two)

Autoencoder: train on normal data, flag high-reconstruction-error samples as anomalous
Variational Autoencoder (VAE): use the reconstruction error (or a combination of reconstruction error and KL divergence) as an anomaly score. Note that the standard approach computes reconstruction error by sampling multiple reconstructions from the encoder distribution for each input, then uses the average or a threshold on the resulting score.
Supervised deep learning: a standard neural network classifier with class weighting, focal loss, or oversampling
Deep SVDD (optional, stretch goal): a neural network variant of one-class classification. Be warned that Deep SVDD is notoriously sensitive to hyperparameters and architecture choices. Only attempt this if you have time after completing the required components.

For autoencoders and VAEs, you should compare two training regimes:

Semi-supervised: train only on “normal” (non-fraudulent) samples, then score all test samples by reconstruction error. The idea is that the model learns what normal looks like and flags anything it cannot reconstruct well.
Supervised: train on the full labeled training set (both normal and anomalous), potentially using class weights or a modified loss.

Compare the two regimes and discuss which works better and why.

You are free to use existing implementations (e.g., PyOD, scikit-learn, PyTorch) but must demonstrate understanding of the methods you use.

Datasets

Work with two datasets: one primary dataset for thorough analysis, and one secondary dataset to test whether your findings generalize.

Primary Dataset (required)

IEEE-CIS Fraud Detection (Kaggle): a large-scale e-commerce fraud dataset with raw features (transaction amount, card info, device info, email domain, etc.). This dataset is the recommended primary choice because its raw features allow meaningful feature engineering.

Secondary Dataset (choose one)

Credit Card Fraud (Kaggle): ~284,000 transactions, 492 fraudulent (0.17%). Note that all features are PCA-transformed and anonymized, so feature engineering is not possible on this dataset. It serves as a modeling-only benchmark.
CICIDS2017: modern network intrusion detection dataset with labeled attack types, reflecting current attack patterns.
PaySim: a synthetic mobile money dataset with realistic transaction features and full transparency about the generation process.

All of these are publicly available. Choose datasets with genuine class imbalance; do not artificially balance them before modeling.

Feature Engineering

On your primary dataset (IEEE-CIS), you must create aggregate and domain-specific features and measure their impact on model performance. Examples include:

Transaction velocity: number of transactions per user/card in the last N hours
Spending pattern deviations: ratio of current transaction amount to the user’s average
Time-based features: hour of day, time since last transaction
Aggregated statistics: mean, max, and standard deviation of transaction amounts per user over rolling windows

Run your best classical model (e.g., gradient boosting) with and without these engineered features and report the difference in AUPRC. This comparison is a required component of your report.

Note: the Credit Card Fraud dataset has anonymized PCA features, so feature engineering applies only to IEEE-CIS (or PaySim/CICIDS2017 if chosen as secondary).

Data Splitting

You must split data chronologically. Random splits are not acceptable for transaction data. Fraud data is sequential, and random splits leak future information into the training set, producing inflated metrics that do not reflect real-world performance.

Use the timestamp or time-ordering information available in your dataset to create a temporal train/test split. Justify your choice of split point in the report (e.g., “we used the first 80% of transactions by time for training and the last 20% for testing”).

If your secondary dataset lacks explicit timestamps, state this and explain your splitting strategy.

Evaluation Metrics

Standard accuracy is not appropriate for imbalanced datasets. Report the following:

Area Under the Precision-Recall Curve (AUPRC): the primary metric, as it focuses on the minority class
ROC AUC: useful for overall discrimination ability
Precision, Recall, and F1 at a chosen operating threshold
Precision-Recall curve: plotted for your best models, showing the trade-off explicitly

Cost-Sensitive Evaluation

Beyond standard metrics, perform the following exercise for your primary dataset. Assume:

A false negative (missed fraud) costs $500 on average
A false positive (legitimate transaction flagged) costs $2 in customer friction

For your best model, compute the expected cost per transaction as a function of the decision threshold. Find the threshold that minimizes total expected cost. Compare this optimal threshold to the one you would choose by maximizing F1. Are they different? Discuss why.

Error Analysis

For your best model on the primary dataset, manually inspect 10-20 false positives and 10-20 false negatives. For each group:

Are there common patterns? (e.g., false positives concentrated in a particular transaction type, false negatives involving small amounts)
Can you categorize the errors into a few types?
Do the errors suggest missing features or model limitations?

Report your findings. This analysis often reveals more about model behavior than aggregate metrics alone.

Key Questions to Address

Performance: do deep learning models detect anomalies more effectively than classical methods? On which datasets and under what conditions?
Feature engineering impact: how much do engineered features improve performance compared to using raw features alone?
Semi-supervised vs. supervised: for autoencoders/VAEs, does training only on normal data outperform training on the full labeled set? Under what conditions?
Representation learning: do autoencoders/VAEs learn useful representations of “normal” behavior? Can you visualize the learned latent space?
Threshold sensitivity: how sensitive are your results to the choice of decision threshold? How does the cost-optimal threshold differ from the F1-optimal one?
Practical deployment: given your results, what approach would you recommend to a fintech company building a fraud detection system? What factors beyond raw performance matter (inference speed, interpretability, maintenance)?

Deliverables

You must submit:

A written report (PDF)
- Maximum length: 10 pages
- No code in the report
- Clear explanation of your datasets, feature engineering, temporal splitting strategy, methodology, results, and analysis
- Include precision-recall curves, performance comparison tables, cost-threshold analysis, error analysis findings, and any relevant visualizations
Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Well organized, with a requirements.txt or environment.yml

Why This Matters

Fraud detection is one of the highest-impact applications of machine learning in industry. Financial institutions, e-commerce platforms, and insurance companies all rely on automated systems to flag suspicious activity. The challenges you will encounter in this assignment, class imbalance, temporal data splitting, feature engineering, threshold tuning, asymmetric costs, and the tension between precision and recall, are exactly the challenges faced by ML engineers working on these problems in production. Learning to navigate them with both classical and deep learning tools is a valuable and transferable skill.