Working with the California Housing Dataset in Scikit-Learn

Scikit-learn is one of the most popular Python libraries for machine learning. It provides simplicity and versatility for various machine learning scenarios, offering a wide range of algorithms for classification, regression, clustering, and more. In this article, we'll specifically focus on working with the California Housing dataset available within the Scikit-learn library. This dataset can be used to build and evaluate regression models.

Loading the Dataset
Exploring the Dataset
Building a Regression Model
Feature Scaling and Enhancements
Conclusion

Loading the Dataset

Scikit-learn makes it incredibly easy to load this dataset, which we can use for training regression models. The California Housing dataset comes with eight quantitative features and a target reflecting house values from the California census data in 1990. Let's start by loading it using the Scikit-learn API:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load the dataset
california_housing = fetch_california_housing()

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    california_housing.data, california_housing.target, test_size=0.2, random_state=42)

In this block of code, we import the necessary modules and load the dataset using the fetch_california_housing() function. Splitting the data into training and test sets is achieved with the train_test_split function, provided by scikit-learn.

Exploring the Dataset

It's important to explore and understand our dataset before diving into modeling. Here's a brief look at the data:

import pandas as pd

# Convert to DataFrame for easy manipulation
columns = california_housing.feature_names
california_df = pd.DataFrame(california_housing.data, columns=columns)

# Show first few rows of the dataframe
print(california_df.head())

This prints out a preview of the California housing data, giving us an understanding of each feature.

Building a Regression Model

Regression models are ideal for our task, as we’re trying to predict continuous values (in this case, house prices). Let's use a simple Linear Regression model to start:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Instantiate the model
model = LinearRegression()

# Fit the model on training data
model.fit(X_train, y_train)

# Predict house values using the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error: ", mse)
print("R2 Score: ", r2)

In the code above, we first import a linear regression model and some metrics for evaluating the model's performance. The model is trained using the training dataset and then used for predictions on the test dataset. The model's performance is evaluated using the mean squared error (MSE) and the R² score.

Feature Scaling and Enhancements

To improve our model further, we can perform feature scaling, ensuring all features contribute equally in the model's training. Scikit-learn offers many tools to tweak and enhance our models:

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Create a pipeline with scaling and modeling
pipeline = make_pipeline(StandardScaler(), LinearRegression())

# Fit and evaluate as before
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error after scaling: ", mse)
print("R2 Score after scaling: ", r2)

With these pipelines, we can consolidate the preprocessing and modeling steps, resulting in easier scaling and maintenance of the code. After scaling, the features are standardized to have zero mean and unit variance, potentially improving model performance.

Conclusion

The California Housing dataset serves as an excellent foundation for experimenting with regression in scikit-learn. By following this article, you'll gain an understanding of loading datasets, exploring the data, developing regression models, and enhancing your models through preprocessing techniques. Building and refining models iteratively using tools like Scikit-learn provides invaluable insights and paves the way towards mastery in machine learning.

Next Article: Scikit-Learn's `fetch_covtype` for Forest Cover Type Classification

Previous Article: Fetching the 20 Newsgroups Dataset with Scikit-Learn

Series: Scikit-Learn Tutorials

Scikit-Learn