California House Price Prediction Using Liner Regression

Link to Github repo

You are working with a real estate analytics team that helps housing agencies and government planners understand how different factors affect property prices across California. Using data from the 1990 California Census, your task is to build a Linear Regression model that predicts the median house value in a given census block, based on numerical features such as median income, housing age, total rooms, population, and more.

Dataset

The dataset includes over 20,000 records and covers diverse regions including areas near the ocean, bay, and inland zones. The model will serve as a simple, interpretable tool to estimate housing prices and identify the most influential factors driving them.

Why is this important?

- Helps housing agencies and planners estimate property values across California using census data.
- Demonstrates how to handle real-world datasets with missing values, capped targets, and categorical features.
- Reinforces the importance of feature engineering, preprocessing, and evaluation in regression tasks.

Features

- longitude, latitude (location coordinates)
- housing_median_age (median age of houses in the district)
- total_rooms, total_bedrooms (number of rooms and bedrooms)
- population (total people in the district)
- households (total number of households)
- median_income (median income of residents)
- ocean_proximity (categorical feature indicating the district’s proximity to the ocean)

Target

- median_house_value (median house price in USD for the block)

Size

- 20,640 entries
- 9 numeric features, 1 categorical feature, 1 numeric target variable

Packages Used

- Numpy
- Pandas
- Matplotlib
- Seaborn
- SciKitlearn

Steps Taken

- Loading & Exploring Dataset
- visualizing the Data
- Handling Missing Values
- Handling Capped Values
- Handling Categorical Values
- Split Data into Train & Test
- Feature Sacling
- Train Linear Regression
- Predicting on Test Data
- Evaluate the Model

Conclusion

In this project, we built a Linear Regression model to predict median house values across California using data from the 1990 Census. The model was trained on key numeric and categorical features, including median income, housing age, population, and proximity to the ocean.

Performance Summary:

- Mean Absolute Error (MAE):44,136
- Root Mean Squared Error (RMSE):59,669
- R² Score:0.6156

These metrics suggest that, on average, our predictions deviate from actual house values by around $44,000, and the model explains about 61.5% of the variance in housing prices. While the model captures general trends (especially the strong correlation between income and price), the relatively high error values indicate that housing prices are influenced by non-linear relationships and complex interactions not fully captured by a simple linear model.

Takeaway

Linear regression provided a simple and interpretable baseline, helping identify influential features like median_income. However, to achieve more accurate predictions, especially in areas with extreme property values or unique geographic characteristics, more advanced models like Random Forests or Gradient Boosting could be explored in future iterations.