Pushing the Limits of Linear Regression

DSS5104

Overview

The goal of this assignment is to explore how far one can push the predictive performance of linear regression through creative and rigorous feature engineering, applied to the task of predicting house prices. While modern machine learning tends to emphasize complex models like gradient boosting or neural networks, this assignment intentionally restricts you to linear regression—not to limit you, but to challenge your ability to understand data deeply, model relationships intelligently, and build interpretable models that are robust and insightful.

You will work with the dataset house_dataset.csv, which contains various attributes of houses and their sale prices. Your task is to construct a linear regression model but you are free—and encouraged—to apply any form of feature engineering that enhances the model’s expressiveness while preserving its linear form.

Modeling Constraint

You are limited to using linear regression as your modeling technique. However, this includes: - Ordinary least squares (OLS) - Regularized linear models (e.g., ridge or lasso, if properly justified) - Models built after transformation of inputs or the target (see below)

Creative Freedom in Feature Engineering

You are free to be as creative as possible in transforming the features and/or the prediction target, provided that the final model remains linear. Possibilities include, but are not limited to:

Transformations of features
Apply log, square root, polynomial terms, splines (e.g., piecewise linear or cubic), binning, or custom thresholds to expose non-linear patterns while retaining a linear modeling structure.
Interactions
Between numerical variables, categorical variables, or combinations thereof
Handling categorical data
Use one-hot encoding, group rare categories together, or carefully apply target encoding (e.g., average price by category) while preventing data leakage, exploit location, etc..
Composite features
Engineer meaningful domain-informed variables like price per square foot, total number of bathrooms, room density, age of the house at time of sale, or binary indicators of luxury features
Distance-based and neighborhood features Use K-nearest neighbors (KNN) to build local or similarity-based features such as the mean price of the k most similar houses (based on selected features)
Dimensionality reduction
PCA or clustering-based grouping, if helpful
Transforming the target
Predicting log(price), price per square foot, or even price normalized by neighborhood average—all are valid, as long as your final predictions can be mapped back to actual prices for evaluation

Your feature engineering should be guided by both statistical intuition and exploratory data analysis. Interpretability, parsimony, and understanding of the data should drive your decisions, not just error minimization.

Evaluation Metric

Your model will be evaluated using the Mean Absolute Percentage Error (MAPE):

\[ \text{MAPE} = \frac{100\%}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \]

where \(y_i\) is the actual sale price of the \(i\)-th house, \(\hat{y}_i\) is your model’s predicted price and \(n\) is the number of observations.

This metric captures the average relative error, which is especially appropriate when prices vary widely across the dataset. You may choose to model log(price) or other transformed targets internally, but your final predictions must be mapped back to raw price values for evaluation.

Train/Test Split and Validation Strategy

You are responsible for creating your own train/test split from the provided dataset. The test set must be held out entirely during the feature engineering, model selection, and training phases—it should be used only once, at the end, to evaluate your final model.

Throughout the model development process, you should rely on cross-validation (e.g., k-fold or repeated k-fold) to assess model performance and guide feature selection. This helps ensure that your choices generalize beyond the specific training data and are not overfit to noise.

Be especially mindful of data leakage. This includes:

Leaking information from the test set into training (e.g., through target encoding applied globally)
Creating features that implicitly use future or aggregate information not available at prediction time
Applying data transformations (e.g., scaling, imputing, encoding) using the entire dataset instead of fitting only on the training portion during cross-validation

Benchmarking Against a Non-Linear Model

To contextualize the performance of your linear model, you should also build a well-tuned XGBoost model as a benchmark. This comparison is not meant to “beat” XGBoost, but rather to understand how close a thoughtfully engineered linear model can get to a strong non-linear model in terms of predictive accuracy.

The comparison should be done on the same held-out test set, using the same evaluation metric (MAPE). Briefly report the performance of XGBoost alongside your linear model, and reflect on the trade-offs: interpretability vs. accuracy, complexity vs. insight, and any differences in the types of features each model appears to exploit.

This benchmark will help you critically evaluate the value of your linear modeling choices and deepen your understanding of when and why simple models can perform competitively.

Deliverables

You must submit the following:

A written report in PDF format
- Maximum length: 10 pages (shorter reports are welcomed if concise and clear)
- The report should contain no code, but should clearly explain your approach, methodology, key decisions, model diagnostics, and findings.
- Include relevant plots or tables where helpful
Your code
- Submit either as a zip file containing all notebooks, scripts, and necessary resources
- Or provide a link to a GitHub repository
- Your code should be well-organized, reproducible, and allow easy verification of the results discussed in the report.

Why This Matters in Practice

In many real-world applications, a well-engineered linear model with thoughtful, domain-driven feature engineering is not just a pedagogical exercise—it is a preferred modeling approach in industry. This is especially true when interpretability, transparency, and robustness are essential.

For example:

In real estate, analysts and stakeholders often want to understand how specific features (like location, size, or renovations) affect price, not just get a black-box prediction.
In finance, risk models must often be auditable and explainable for regulatory compliance—making interpretable linear models with engineered features a standard.
In healthcare, treatment effect models, cost estimation, or hospital resource forecasting require clarity and justifiability in model behavior.
In public policy or urban planning, decision-makers need models that provide insight into the data, not just accuracy—understanding the role of variables like income, zoning, or infrastructure is critical.

This assignment trains you to think like a data scientist who not only builds accurate models but also extracts value and understanding from data—which is often the more important and lasting contribution.