Deep Learning for Recommender Systems
Overview
Recommender systems are among the most widely deployed machine learning applications in industry. Every time you see “Recommended for you” on Netflix, Amazon, Spotify, or YouTube, a recommender system is at work. Despite their ubiquity, recommender systems are rarely covered in depth in ML courses.
Traditionally, collaborative filtering methods such as matrix factorization have been the backbone of recommendation. In recent years, deep learning approaches, including neural collaborative filtering, two-tower retrieval models, and session-based architectures, have been proposed as alternatives that can capture richer user-item interactions. Whether these more complex models actually outperform well-tuned classical baselines remains an active and sometimes contentious debate. Dacrema, Cremonesi, and Jannach (2019) showed that many published deep learning recommender models failed to beat properly tuned classical baselines, a finding that should inform how you approach this assignment.
In this assignment, you will build and evaluate both classical and deep learning-based recommender systems on public datasets. Your goal is to critically assess whether the added complexity of deep learning translates into meaningful improvements in recommendation quality.
Industry context: the multi-stage pipeline
Production recommender systems at companies like YouTube, Netflix, or TikTok are not single models. They typically follow a multi-stage pipeline: (1) candidate generation, where a fast model (often a two-tower architecture) retrieves a few hundred candidates from millions of items; (2) ranking, where a more expressive model scores and orders these candidates; and (3) re-ranking, where business rules, diversity constraints, and freshness considerations produce the final list. In this assignment, you will focus on the ranking stage and offline evaluation. Keep the full pipeline in mind, though: understanding where your model fits in a production system is something interviewers and hiring managers care about.
Assignment Objectives
- Understand and implement collaborative filtering approaches (both classical and neural)
- Compare deep learning-based recommender models against well-tuned classical baselines
- Evaluate recommendation quality using ranking metrics with a proper temporal evaluation protocol
- Analyze the trade-offs between model complexity, training cost, and recommendation quality
- Reflect on when deep learning adds genuine value in recommendation settings
Implicit vs. Explicit Feedback
Before diving into methods, it is worth clarifying a distinction that shapes the entire modelling pipeline.
Explicit feedback consists of direct user ratings (e.g., 1-5 stars on MovieLens). The signal is clear but sparse: most users rate only a small fraction of items.
Implicit feedback consists of behavioural signals like clicks, purchases, or play counts (e.g., Last.fm listening history, Amazon purchase data). Implicit feedback is far more abundant, and most production systems rely on it, but it introduces a fundamental asymmetry: you observe positive interactions, but the absence of an interaction does not mean the user dislikes the item. She may simply not have seen it. This asymmetry affects loss function design (pointwise vs. pairwise vs. BPR loss), requires a negative sampling strategy during training (since you cannot treat all unobserved items as negatives), and changes how you evaluate predictions. We require you to work with both types of feedback in this assignment; see the Datasets section below.
Methods
You should implement and compare models from both paradigms. We require one non-personalized baseline, one classical collaborative filtering model, and at least two deep learning models. This reduced model count (compared to building everything) frees up time for deeper analysis: hyperparameter sensitivity, ablation studies, or embedding visualizations.
Baselines
- Popularity baseline: recommend the most popular items (non-personalized). This is a sanity check; any personalized model that cannot beat it has a problem.
- Matrix Factorization: SVD or ALS-based collaborative filtering (e.g., using Surprise for explicit data, or the
implicitlibrary for implicit data)
Deep Learning Models (choose at least two)
- Neural Collaborative Filtering (NCF): the original NCF model from He et al. (2017) is a hybrid that combines a Generalized Matrix Factorization (GMF) branch with an MLP branch. The GMF branch performs element-wise product of user and item embeddings (a generalization of the standard dot product), while the MLP branch operates on concatenated embeddings to learn nonlinear interactions. The two branches are fused for the final prediction. Note that subsequent work (Dacrema, Cremonesi, and Jannach 2019) has questioned whether this architecture reliably outperforms well-tuned dot-product baselines.
- Two-tower model: separate user and item encoder networks that produce fixed-size embeddings, combined via dot product or cosine similarity. The use of a simple inner product (rather than a learned cross-attention) is deliberate: it enables fast approximate nearest neighbour retrieval at inference time, which is why two-tower models are the standard architecture for candidate generation in industry.
- Session-based models: GRU4Rec or SASRec (Transformer-based), which model sequential user behaviour. These require timestamped interaction data with meaningful temporal ordering.
- Autoencoders: e.g., MultVAE, which treats recommendation as a generative modelling problem.
Libraries such as RecBole, LensKit, Surprise, or plain PyTorch are all suitable. You are free to use existing implementations but must demonstrate understanding of the model you are using.
Datasets
Use two datasets: one with explicit feedback and one with implicit feedback. This pairing forces you to adapt your training pipeline and evaluation to both settings.
- MovieLens 1M or 100K (explicit feedback): movie ratings, the classic benchmark. Well-documented, clean, and small enough for Colab.
- Last.fm (implicit feedback): music listening counts. A natural complement to MovieLens since it requires you to handle implicit signals.
- Amazon Product Reviews (implicit feedback): purchase and review data. Use a manageable category subset such as Digital Music or Video Games. Avoid “Books”, which is very large even after subsampling. Use the 2023 version of the dataset.
Avoid Goodreads (access has become unreliable) and Yelp (very large, requires significant preprocessing that is not the focus of this assignment).
These datasets are small enough to train on Google Colab. If you use a large dataset, subsample it appropriately and document your choices.
Evaluation Protocol
Since recommendation is fundamentally a ranking problem, use ranking-aware metrics. Report the following for each model, computed on a held-out test set:
- Hit Rate at K (HR@K): fraction of users for whom a relevant item appears in the top-K recommendations
- Normalized Discounted Cumulative Gain at K (NDCG@K): measures ranking quality, giving more credit to relevant items ranked higher
- Mean Average Precision (MAP): average precision across users
Use \(K = 10\) as the primary cutoff, and optionally report \(K = 5\) and \(K = 20\) for comparison.
Binarization of graded relevance
MAP and HR@K assume binary relevance. For datasets with graded ratings (e.g., MovieLens), you must choose a binarization threshold: for instance, rating \(\geq 4\) counts as “relevant”. State your threshold explicitly and keep it consistent across all models. This choice affects absolute metric values, so it must be documented.
Temporal train/test splitting
Do not split data randomly. Random splitting creates data leakage: you end up “predicting” interactions that happened before your training data. Instead, split by timestamp. A reasonable default: train on the first 80% of each user’s interactions chronologically, validate on the next 10%, and test on the final 10%. Use the validation set for hyperparameter selection.
Negative sampling note
For implicit feedback datasets, you will need a negative sampling strategy during training (e.g., uniform random sampling of unobserved items). The choice of how many negatives to sample per positive, and how to sample them, can significantly affect model performance. Document your choices and, ideally, test sensitivity to the negative sampling ratio.
Cold-start evaluation
To evaluate cold-start behaviour, hold out all users with fewer than 5 interactions in the training set. Train your models without these users. Then evaluate your models’ ability to generate recommendations for these held-out users using only their limited interaction history. Report metrics separately for cold-start users and regular users.
Key Questions to Address
- Performance: do deep learning models outperform well-tuned matrix factorization? By how much? Ensure your baselines are properly tuned before drawing conclusions; see Dacrema, Cremonesi, and Jannach (2019) for cautionary context.
- Cold start: how do different models handle cold-start users (those with fewer than 5 interactions)? Report cold-start metrics separately.
- Sequential patterns: if you implement a session-based model (GRU4Rec, SASRec), does modelling temporal dynamics help? Or is the ordering of interactions not very informative for your datasets?
- Scalability: how do training times compare across methods? Is the compute cost of deep models justified by the performance gain?
- Practical recommendation: given your results, which approach would you recommend to a company building a recommender system from scratch? Under what conditions?
Deliverables
You must submit:
- A written report (PDF)
- Maximum length: 10 pages
- No code in the report
- Clear explanation of your datasets, methodology, experimental setup, results, and analysis
- Include tables and/or figures comparing model performance
- Include a “Lessons from Baselines” section: discuss what you learned from the popularity baseline and matrix factorization results before presenting deep learning numbers
- Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Well organized, with a
requirements.txtorenvironment.yml
Why This Matters
Recommendation is one of the largest commercial applications of machine learning. Companies like Netflix, Spotify, Amazon, and TikTok invest heavily in their recommender systems because even small improvements in recommendation quality translate directly into user engagement and revenue. Understanding the landscape of recommendation methods, from simple collaborative filtering to deep sequential models, and knowing how to evaluate them rigorously, prepares you for one of the most common ML roles in industry. This assignment gives you the chance to build, evaluate, and reason about these systems from the ground up.