Deep Learning for Tabular Data
Overview
The goal of this assignment is to critically explore and evaluate recent deep learning methods for tabular data prediction. While deep learning has transformed vision and language, tabular data, often structured, heterogeneous, and small-to-medium in size, remains a domain where simpler methods frequently win. Linear models still dominate in regulated industries (credit scoring, insurance) where interpretability is paramount, and tree-based models such as XGBoost consistently perform well across a wide range of tabular tasks with minimal preprocessing. Yet in recent years, new deep learning architectures designed for tabular data have emerged, including TabNet, NODE, FT-Transformer, and others. These models attempt to bring the benefits of representation learning to structured data.
The empirical picture is nuanced. Grinsztajn et al. (2022) provide evidence that tree-based models still outperform deep learning on typical tabular data, while Gorishniy et al. (2021) show that well-tuned deep models can be competitive on certain benchmarks. Your job is to run your own experiments and form your own view.
In this assignment, you will evaluate the performance, robustness, preprocessing requirements, and inference cost of a selection of these methods using publicly available implementations on a variety of non-trivial tabular prediction tasks. You are not expected to re-implement these models from scratch, but you are expected to understand how to use them properly and interpret their behavior.
Assignment Objectives
- Understand and apply recent deep learning methods for tabular data using existing libraries
- Critically assess their performance across multiple datasets using a rigorous experimental protocol
- Compare their results against classical baselines (gradient boosting, logistic regression, random forests)
- Analyze the practical trade-offs: preprocessing effort, training cost, inference speed, and sensitivity to hyperparameters
- Reflect on when and why these methods may be preferable, or not, to classical approaches
Methods
You must compare at least two deep learning methods and at least two classical baselines.
Deep learning methods (pick at least 2):
- TabNet
- FT-Transformer
- NODE (Neural Oblivious Decision Ensembles)
- TabTransformer
- SAINT
- Tabular ResNet (as described in Gorishniy et al., 2021)
Classical baselines (pick at least 2):
- XGBoost, LightGBM, or CatBoost
- Logistic regression (classification) or ridge regression (regression)
- Random forest
You may include additional methods beyond this list, but the above sets the minimum.
Datasets and Prediction Tasks
Choose 3 to 4 datasets satisfying the following constraints:
- Minimum size: at least 5,000 rows per dataset. No toy datasets (Iris, Wine, Boston Housing, Titanic).
- Task diversity: include at least one regression task and at least one classification task.
- Feature diversity: at least one dataset should have significant missing values or high-cardinality categorical features, since these expose important practical differences between model families.
Suggested datasets (you may choose others meeting the criteria above):
| Dataset | Task | Rows | Why it is interesting |
|---|---|---|---|
| California Housing | Regression | ~20K | Well-understood baseline, continuous features |
| Adult Income (Census) | Classification | ~49K | Mixed feature types, moderate size |
| Covertype | Classification | ~580K | Larger scale, tests scalability |
| HIGGS | Classification | ~11M | Known to favor neural nets at scale (subsample if needed) |
| Porto Seguro Safe Driver | Classification | ~595K | Heavy missing values, categorical features |
Experimental Protocol
Fair comparison requires a consistent experimental setup. Follow these rules:
- Data splits: Use either 5-fold cross-validation or a fixed train/validation/test split (e.g., 60/20/20). Use the same splits for all methods.
- Random seeds: Report results as mean and standard deviation across at least 3 random seeds.
- Hyperparameter tuning: Apply a comparable tuning budget to all methods. For example, 50 trials of Optuna per method, or a documented manual search of similar effort. Do not extensively tune one method while using defaults for another.
- Evaluation metrics: Report at least two metrics per task type.
- Classification: accuracy plus AUC-ROC or F1-score
- Regression: RMSE plus MAE or R-squared
Key Questions to Address
In your report, aim to answer:
- Performance: How well do deep learning models perform relative to classical baselines? Are there dataset characteristics (size, feature types, missingness) that predict which model family wins?
- Preprocessing effort: How much preprocessing does each model family require? Document the pipeline for each method. Note differences in categorical encoding (one-hot, target encoding, learned embeddings) and missing value handling (imputation vs. native support).
- Inference cost: Report prediction latency (time per sample or per batch) and model size for each method. Would these models be practical to deploy behind an API?
- Sensitivity: How sensitive are results to hyperparameters and random seeds? Which methods are more stable?
- Interpretability: How do feature importance rankings compare across model types? Do deep models and tree-based models agree on which features matter?
- Practical verdict: Given the full picture (accuracy, cost, effort, stability), would you recommend deep tabular models in practice? Under what conditions?
Deliverables
- A written report in PDF format
- At most 10 pages (shorter is better if you are concise).
- No code in the report.
- Include summary tables of results (mean +/- std across seeds) and any relevant figures.
- Your code
- Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
- All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
- Clearly organized notebooks/scripts covering:
- Data preparation and preprocessing pipelines
- Model training and hyperparameter tuning
- Metric reporting and result tables
- Visualizations and interpretability analysis
- Include a
requirements.txtorenvironment.yml
Key References
- Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? NeurIPS.
- Gorishniy, Y., Rubachev, I., Khrulkov, V., & Babenko, A. (2021). Revisiting deep learning models for tabular data. NeurIPS.
Final Thoughts
This assignment asks you to engage with a genuinely open question in applied machine learning. The evidence on deep learning for tabular data is mixed, and the answer depends on the dataset, the compute budget, and the practical constraints of the application. Your job is not to prove that one approach is universally better, but to run careful experiments and draw honest conclusions.
By the end, you should be able to advocate for or against deep tabular methods with reasoning grounded in your own experiments, not in hype or received wisdom.