*** Housing Market Classifier *** 86.5% test accuracy with Gradient Boosting *** 1,200 samples across 15 Norwegian counties *** didriksi.com ***
« Back to homepage ML PROJECT

Housing Market Classifier

Results
87.3%
Validation accuracy
86.5%
Test accuracy
0.834
Test macro F1
1,200
Samples
35
Features

Per-class test performance:

Class Precision Recall F1
Hot 0.89 0.93 0.91
Stable 0.73 0.65 0.69
Cooling 0.90 0.92 0.91
Confusion Matrix
Confusion matrix showing validation and test results for GradientBoosting classifier

Validation: 207/237 correct. Test: 154/178 correct. The model reliably identifies Hot and Cooling markets (F1 ≥ 0.91). Stable is the hardest class (F1 = 0.69) — most errors come from Stable markets being misclassified as Hot or Cooling.

Feature Importance
Top 15 features by Gini importance for GradientBoosting model

Seasonal quarter factor dominates (0.39 Gini importance), reflecting strong seasonal patterns in Norwegian housing. Price acceleration, GDP-price interaction, price index level, and price momentum round out the top 5.

Market State Definitions

Housing market conditions are identified by analyzing the balance between supply (inventory) and demand (buyer activity):

Metric Hot (Seller's) Stable (Balanced) Cooling (Buyer's)
Price appreciation Rapidly rising Moderate (~3%/yr) Slowing / declining
Inventory Very low (<3 mo) Balanced (4–6 mo) Rising / high
Days on market Very short Average Longer / rising
Price reductions Rare Occasional Increasing / frequent
Bargaining power Seller Balanced Buyer

In this project, the labels are derived from next-quarter price index changes: HOT (>2%), STABLE (−0.5% to 2%), and COOLING (<−0.5%).

Approach

Started with a Random Forest built from scratch (no scikit-learn) to understand the fundamentals — Gini impurity, bootstrap sampling, majority voting. The initial model trained on 80 manually downloaded samples achieved 67% validation accuracy and couldn't predict Stable at all.

Built an automated SSB API pipeline that fetches 10 statistical tables programmatically, covering 2005–2024 across all 15 Norwegian counties. This expanded the dataset to 1,200 samples with 35 engineered features derived from house prices, CPI, unemployment, GDP, building permits, mortgage rates, population growth, and household income.

Compared RandomForest (balanced class weights) against GradientBoosting — Gradient Boosting won on macro-F1 and became the final model.

Pipeline
fetch_ssb_data.py → data_parser.py → enhanced_features.py → train_model.py (API) (JSON→CSV) (35 features) (sklearn)
Data Sources (SSB API)
Table Feature
03013Consumer Price Index
10701Policy rate
01222Population change
07221House price index
10187Property sales volume
13760Unemployment rate
03723Building starts
10748Mortgage interest rates
09171GDP volume change
06944Household income

Coverage: 2005–2024, 15 counties, quarterly. All fetched automatically via the SSB Statistikkbanken API.

Model Configuration
Algorithm Gradient Boosting
Trees 200 estimators
Max depth 5
Learning rate 0.1
Subsample 0.8
Features 35 engineered
Data split 65/20/15 chrono
Compared vs RandomForest (balanced)
Tech Stack
  • Python 3.12+
  • scikit-learn
  • Pandas
  • NumPy
  • Matplotlib
  • Seaborn
  • SSB API

Original from-scratch Random Forest kept as reference implementation.

Counties Covered
Oslo Rogaland Vestland Trøndelag Innlandet Agder Nordland Møre og Romsdal Viken Vestfold og Telemark Troms og Finnmark

15 counties, 2005–2024.

Claude Opus 4.6 was used as a development tool throughout the project.
didriksi.com | Course Catalog | © 2026 Didrik Sivertsen