House Price Prediction with Linear Models
Overview
The goal of this assignment is to explore how far one can push the predictive performance of linear models through careful and creative feature engineering, applied to the task of predicting house prices. We have already seen this dataset in the Chapter on “Applied Linear Regression”. While much of modern machine learning focuses on complex models such as gradient boosting or neural networks, this assignment deliberately emphasizes linear approaches. The objective is not to handicap you, but to challenge you to extract as much signal as possible from the data while preserving interpretability, transparency, and simplicity.
You are expected to demonstrate that, when used thoughtfully, linear models can perform surprisingly well and remain far easier to understand, explain, and maintain than more complex alternatives.
Modeling Constraint
Your main predictive model must be based on linear regression. This includes:
- Ordinary least squares (OLS) and Regularized linear models (ridge, lasso, elastic net)
- Linear models applied after transforming inputs or the target variable
- Linear models built on top of engineered or learned features
- Generalized linear models (GLMs) with appropriate link functions such as Logistic
- You are indeed free to choose the quantitative target variable (e.g., price, log-price, normalized price) and to transform it as needed.
Forbidden methods
You may not use:
- Neural networks
- Decision trees
- Random forests
- Gradient boosting methods
If in doubt, just ask!
Creative Freedom in Feature Engineering
You are strongly encouraged to be creative and thoughtful in how you construct features. Possible directions include (non-exhaustive):
Feature transformations
Log, square root, polynomial terms, piecewise linear transformations, binning, or custom thresholds to expose non-linear relationships in a linear framework. Basis expansions (eg. splines) are also often useful: they form the basis for many generalized additive models (GAMs).Interactions
Between numerical variables, categorical variables, or combinations thereof.Categorical feature handling
One-hot encoding, grouping rare categories, frequency-based encodings, or target-based encodings (with appropriate care to avoid data leakage).Composite and domain-informed features
Examples include house age at sale, renovation indicators, ratios such as living area per room, or indicators of premium or atypical properties.Neighborhood and similarity-based features
You may use K-nearest neighbors (KNN) or related approaches to construct features such as local averages or similarity-based summaries.Ensembling multiple linear models
You may fit separate linear models on different segments of the data (e.g., by location or property type) and combine their predictions in any reasonable way you deem fit.Dimensionality reduction or grouping
PCA or clustering-based features may be used if helpful.Target transformations
You may model transformed versions of the price (e.g., log-price or normalized price), as long as predictions can be converted back to actual prices for evaluation.
Your feature engineering choices should be motivated by exploratory analysis, intuition, and clarity of reasoning. Do document and justify your decisions in your report.
Evaluation Metric
Model performance will be evaluated using the Mean Absolute Percentage Error (MAPE), defined as:
\[ \text{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \]
where \(y_i\) is the true sale price and \(\hat{y}_i\) is the predicted price for observation \(i\).
This metric emphasizes relative error, making it well-suited to house prices that vary over a wide range. If you use a transformed target internally, your final predictions must be mapped back to price space before computing MAPE.
Validation and Model Development
You are responsible for designing a reasonable train/test split and using cross-validation during model development to guide feature selection and regularization choices. You should write the report as if you were presenting your findings to a stakeholder interested in deploying a linear model for house price prediction. In other words, treat this as a real-world data science project. Predictive performance on the test set is the primary criterion for success – overfitting must be avoided at all costs.
Benchmark: XGBoost Upper Bound
To contextualize the performance of your linear model, you are asked to train a well-tuned XGBoost model as a benchmark. Its purpose is to provide a practical upper bound on what is achievable with a powerful non-linear method on this dataset. You should report its test-set MAPE and briefly comment on the performance gap between your linear model and XGBoost.
Interpretability and Communication
In addition to predictive performance, you are expected to spend time interpreting your final linear model.
You should:
- Identify and explain the most influential features.
- Insight you have gained about the factors affecting house prices
- Communicate results as if you were a data scientist explaining the model to a non-technical client
The goal is to demonstrate that your model provides insight, not just predictions.
Deliverables
You must submit:
- A written report (PDF)
- Maximum length: 10 pages
- No code in the report
- Clear explanation of your approach, feature engineering decisions, model performance, and interpretation
- Include figures or tables where useful
- Maximum length: 10 pages
- Your code
- Provided as a zip file or a link to a GitHub repository
- Fully reproducible and well organized. I should be able to run it to obtain your results.
- Provided as a zip file or a link to a GitHub repository
Why This Matters
In many real-world settings, well-designed linear models are preferred over complex black-box models due to their interpretability, stability, and ease of maintenance. This is especially true in domains such as real estate, finance, healthcare, and public policy, where understanding why a model makes a prediction can be as important as the prediction itself.
This assignment is designed to train you to think beyond model complexity and focus instead on data understanding, feature design, and clear communication—skills that are central to effective data science in practice.