Critique: House Price Prediction with Linear Models

1. Strengths

The assignment gets several things right.

Correct pedagogical instinct. Constraining students to linear models forces them to invest in feature engineering, EDA, and domain reasoning rather than reaching for XGBoost and calling it a day. This is one of the most transferable skills in applied ML, and it is underweighted in most curricula.

XGBoost benchmark. Asking students to quantify the gap between their linear model and a strong nonlinear baseline is good practice. It teaches them to frame results in context, which is exactly what a hiring manager or stakeholder expects.

Interpretability emphasis. Requiring students to explain their model to a “non-technical client” mirrors real deliverables in consulting, finance, and product data science.

Generous creative freedom. The feature engineering menu (interactions, KNN-based features, target encoding, ensembling segmented models, splines) is broad enough that strong teams can differentiate themselves.

Clean scope boundaries. The forbidden methods list is clear. Allowing GLMs, target transforms, and ensembles of linear models gives enough room without breaking the spirit of the constraint.

2. Gaps for Career Relevance

No mention of data leakage mechanics. Target encoding is listed as a suggested technique with a parenthetical “(with appropriate care to avoid data leakage),” but no guidance on what that care looks like. In industry interviews, data leakage is one of the most common failure modes. The assignment should require students to explicitly describe their leakage prevention strategy (e.g., fold-based target encoding, smoothing, leave-one-out) and penalize submissions that leak.

No pipeline or reproducibility standards. The deliverable says “fully reproducible,” but does not specify what that means. In practice, this should include: a requirements file, a single entry-point script or notebook, fixed random seeds, and clear instructions for data download. Without these constraints, “fully reproducible” is aspirational.

No discussion of production concerns. The “Why This Matters” section mentions stability and maintainability but the assignment never asks students to address them. Consider asking: How would you retrain this model monthly? What happens when a new neighborhood appears in the data? How would you monitor for drift? Even one paragraph on deployment considerations would add significant career value.

Missing: feature selection justification beyond performance. Industry practitioners must justify feature choices not just by predictive lift but by data availability, cost to compute, legal constraints, and staleness. A prompt asking “which of your features would be hardest to maintain in production, and why?” would sharpen thinking.

No error analysis requirement. The assignment asks for MAPE on the test set but does not require students to analyze where the model fails. Residual analysis, error stratification by price segment or property type, and identifying systematic biases are standard practice. This is a missed opportunity.

3. Scope Concerns

For a group of up to 5 students, the scope is slightly under-scoped for an end-of-semester project.

The core task (feature engineering + linear regression + XGBoost benchmark + 10-page report) is achievable by a competent pair in roughly two weeks. With five people, there is a real risk of uneven contribution. Two students do the work, three write filler.

Suggestions to calibrate scope:

Add a structured ablation study requirement: show MAPE after each major feature engineering step, not just the final number. This forces systematic experimentation and gives each team member a clear sub-task.
Require a brief appendix (outside the 10-page limit) with a contribution table showing who did what.
Consider requiring two distinct modeling strategies (e.g., one global model and one segmented/local model) with a comparison. This naturally splits work across the team.

4. Technical Accuracy

MAPE is a problematic metric, and the assignment should acknowledge this. MAPE is undefined when $y_i = 0$ and is asymmetric: it penalizes over-predictions more lightly than under-predictions of the same magnitude. For house prices, $y_i = 0$ is unlikely, but the asymmetry is real and worth discussing. More importantly, optimizing MAPE directly (rather than MSE or MAE) under a linear model is non-trivial, since the loss is not the one the model is fitted to minimize. The assignment should at least flag this mismatch and ask students to reason about it.

“GLMs with appropriate link functions such as Logistic” is confusing. A logistic link implies a binary or proportional response, not a continuous price. If the intent is to allow a log-link GLM (Gamma regression), say that. Listing “Logistic” here will mislead students into thinking logistic regression applies to this problem, or it will simply confuse them.

“Upper bound” for XGBoost is technically imprecise. XGBoost does not provide an upper bound on achievable performance; it provides a reference point from a strong nonlinear model. A Bayes-optimal predictor or an ensemble of neural networks could do better. This is minor, but “practical reference” or “strong baseline” would be more accurate.

5. Missing Practical Considerations

Data quality and cleaning. The assignment assumes clean data. Real house price datasets have missing values, outliers (e.g., $1 sales between family members), and coding errors. The assignment should explicitly ask students to document their data cleaning decisions and justify them.

Temporal considerations. House prices have temporal dynamics: market cycles, inflation, seasonality. If the dataset spans multiple years, a naive random split will leak future information into training. The assignment should specify whether a temporal split is expected or at least prompt students to think about it.

Multicollinearity. Linear models are sensitive to multicollinearity, especially with aggressive feature engineering (polynomial terms, interactions). The assignment should prompt students to check for and address this, since it directly affects coefficient interpretability, which the assignment also requires.

Confidence intervals. No mention of prediction intervals or uncertainty quantification. In real estate, a point estimate of $500K is far less useful than “$500K with a 90% interval of $450K–$550K.” This is a natural strength of linear models and a missed teaching opportunity.

6. Dataset Suggestions

The assignment references “this dataset” from a prior chapter but does not name it or provide a link. This is a problem for anyone reading the assignment in isolation.

If the dataset is Ames Housing (likely, given the context): This is a solid choice. It has rich features, is well-documented, and has enough complexity for meaningful feature engineering. The main limitation is that it covers a single city (Ames, Iowa) from 2006–2010, which includes the housing crisis. Students should be told this.

Alternative or supplementary datasets to consider:

King County (Seattle) house sales: larger, includes lat/lon for spatial features, publicly available on Kaggle. Good for spatial feature engineering.
California Housing (sklearn): too simple for a 5-person end-of-semester project. Not recommended as the primary dataset.
Zillow or Redfin public data: more realistic but requires more cleaning. Good stretch option.
UK Land Registry: very large, real transaction data. Good for teams that want a challenge.

If you want all teams to use the same dataset for fair comparison, stick with Ames but provide the link and version explicitly.

7. Suggested Improvements

High priority (low effort, high impact):

Fix the GLM sentence. Replace “such as Logistic” with “such as Gamma regression with a log link” or remove the example entirely.
Name and link the dataset explicitly.
Add an error analysis requirement: “Identify the segments where your model performs worst and explain why.”
Require a MAPE-per-step ablation table showing incremental gains from each feature engineering decision.
Specify reproducibility requirements concretely: requirements.txt or environment.yml, a single run script, fixed seeds.

Medium priority (moderate effort, strong payoff):

Add a short section on the MAPE metric’s properties (asymmetry, relationship to the fitting loss) and ask students to discuss the implications.
Require residual diagnostics: residual plots, QQ plots, heteroscedasticity checks. These are the bread and butter of linear model validation and students need to practice them.
Ask one deployment-oriented question: “If this model were deployed to estimate prices monthly, what would need to change?”
Require a contribution table in the appendix.

Lower priority (nice to have):

Consider requiring prediction intervals from the linear model and comparing coverage to the XGBoost benchmark.
Provide a held-out test set that students submit predictions for, rather than letting them define their own split. This prevents cherry-picking splits and enables cross-team comparison.
Add a brief competitive element: rank teams by test-set MAPE on the held-out set. This motivates effort without changing the assignment structure.