Deep Learning for Time-Series Forecasting

DSS5104

Overview

The goal of this assignment is to critically explore and evaluate recent deep learning methods for time-series prediction. Classical models like ARIMA, ETS, and gradient boosting have long been standard for temporal forecasting, and they remain strong. In fact, recent work by Zeng et al. (2023, “Are Transformers Effective for Time Series Forecasting?”) showed that a simple linear model can match or outperform Transformer-based architectures on common benchmarks. The field is actively debating which inductive biases actually help for forecasting, and many of the best-performing deep learning models (PatchTST, DLinear, TiDE) are not sequential at all: they process patches or use simple linear layers rather than RNN-style recurrence.

In this assignment, you will evaluate the performance, scalability, and robustness of a selection of modern deep learning methods using publicly available implementations on a variety of real-world time-series forecasting tasks. You are not expected to implement models from scratch but must demonstrate an understanding of their proper application and a rigorous experimental methodology.


Assignment Objectives

  • Understand and apply recent deep learning models for time-series forecasting
  • Benchmark their performance on multiple datasets using a sound experimental protocol
  • Compare them against classical and simple baselines (including a linear baseline)
  • Analyze the advantages, limitations, and practical considerations of deep models in time-series settings
  • Reflect on when and why deep models outperform traditional methods, or don’t

Models

You must evaluate the following models. Use publicly available implementations (e.g., from the Nixtla ecosystem, Darts, GluonTS, or the original authors’ code).

Deep learning models (evaluate all four, plus one of your choice):

  • PatchTST: a Transformer-based model that processes time series as patches
  • N-BEATS: a deep neural architecture based on backward and forward residual links
  • TiDE: a simple MLP-based encoder-decoder model
  • DeepAR: an RNN-based autoregressive model with probabilistic output

Baselines (evaluate all three):

  • Seasonal naive: repeat the last observed seasonal pattern
  • ETS or AutoARIMA: classical statistical model (e.g., via statsforecast or statsmodels)
  • LightGBM with lag features: a gradient-boosted approach using hand-crafted temporal features

Simple linear baseline:

  • A DLinear-style model: a single linear layer mapping past values to future values. This is critical for grounding your Transformer comparisons, given the results of Zeng et al.

You may add one additional model of your choice (e.g., TimesNet, Informer, Temporal Fusion Transformer) and justify why you selected it.


Datasets

Choose three datasets from the following curated list. You must include at least one univariate and at least one multivariate dataset.

Dataset Type Domain Source
ETTh1 / ETTh2 Multivariate Energy (transformer temperature) Standard benchmark, widely used
Electricity (UCI) Multivariate (370 series) Energy consumption Good for scalability testing
M4 Univariate (100k series) Mixed domains Competition dataset with known results
M5 Hierarchical Retail sales Walmart sales data
Weather Multivariate Climate Regular sampling, well-behaved
Traffic Multivariate Transportation San Francisco road occupancy

If you want to use a dataset not on this list, get approval from the instructor first.

A warning about stock prices: stock prices are essentially random walks, and predicting them teaches bad habits (overfitting to noise, survivorship bias). If you include stock data, you must use a random walk as your baseline, and you should expect that beating it is very difficult.


Experimental Protocol

A rigorous experimental setup is just as important as the model choice. Follow these requirements.

Train/validation/test splits. Use a temporal split: train on the earliest portion, validate on the next, test on the latest. Never randomly shuffle time-series data. If your dataset has multiple series, the temporal ordering must be respected within each series.

Walk-forward validation. Use expanding-window or sliding-window evaluation. Document which approach you use and why.

Metrics. Report at least:

  • MAE (mean absolute error)
  • One relative or scale-free metric: either MASE (mean absolute scaled error) or sMAPE (symmetric mean absolute percentage error)

You may report additional metrics (RMSE, MAPE, etc.) but you must justify your metric choices in the report: explain what each metric captures and why it is appropriate for your datasets.

Multiple seeds. Run each model with at least 3 different random seeds and report mean and standard deviation of your results. Single-seed results are not acceptable.

Preprocessing. Document your preprocessing pipeline: normalization strategy (per-series vs. global), handling of missing values, differencing or detrending if applied, and any feature engineering. Justify your choices.


Key Questions to Address

  1. Performance: How do deep time-series models compare to classical models and the linear baseline in accuracy? On which kinds of data do they excel or struggle? Can they beat the simple baselines?
  2. Computational cost: Report training time per model, hardware used (CPU/GPU, memory), and total GPU hours for the full experiment. Would you pay this computational cost in practice?
  3. Robustness: How sensitive are models to hyperparameters and data volume? Do deep models overfit on smaller datasets?
  4. Practicality: Given your results, would you recommend deep models for time-series forecasting? Under what conditions?

Deliverables

  1. Written Report (PDF)
    • Max 10 pages (shorter is welcome if well-written and to the point)
    • Focus on insights, comparisons, and analysis: no code in the report
    • Must include a description of your preprocessing pipeline and experimental protocol
    • Must include a table of computational costs (training time, hardware) for each model
  2. Your code
    • Create a GitHub repository containing all your code. Include the repository link in your report. Do not submit code on CANVAS.
    • All results reported in the PDF must be reproducible from the repository. Include a README with clear instructions to run your experiments.
    • Organized scripts or notebooks for:
      • Data preprocessing and transformation
      • Model training, evaluation, and tuning
      • Metrics reporting and visualizations
    • Include a requirements.txt or environment.yml

Final Thoughts

This assignment is your opportunity to engage with modern time-series forecasting methods and to evaluate them critically. The central question is not “which model gets the best number” but rather “when is the added complexity of deep learning justified?” Simple models are surprisingly strong in this domain, and a good project will take that seriously.

By the end, you should be able to answer:

  • Are deep models worth the overhead in time-series problems? It is fine if you find that they are not!
  • When are they most beneficial?
  • What trade-offs (accuracy, compute, engineering effort) do they introduce compared to classical approaches?