Sling Academy
Home/Scikit-Learn/How to Perform Calibration with Scikit-Learn's `CalibratedClassifierCV`

How to Perform Calibration with Scikit-Learn's `CalibratedClassifierCV`

Last updated: December 17, 2024

In many machine learning applications, accurate probability estimates are crucial. Whether you're dealing with a classification task where decisions are made based on these probabilities, or simply need well-calibrated probabilities for further analysis, the CalibratedClassifierCV in Scikit-Learn can be an effective tool. Calibration improves how well your predicted probabilities match the expected outcomes across the population of samples. This guide will walk you through the steps to perform calibration using Scikit-Learn.

Understanding Calibration

Calibration of probabilities refers to the process of aligning the probabilities predicted by a model with the true probabilities. For a perfectly calibrated model, a predicted probability of 70% for a certain class means that, in a large number of instances, this event will occur about 70% of the time.

Scikit-Learn's CalibratedClassifierCV helps achieve this by fitting a non-linear calibration model (such as Platt scaling or isotonic regression) on top of any base classifier.

Getting Started with CalibratedClassifierCV

Step 1: Install and Import Necessary Libraries

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import brier_score_loss, precision_score, recall_score, f1_score, confusion_matrix

Ensure that Scikit-Learn is installed in your environment. If not, you can install it using pip:

pip install scikit-learn

Step 2: Generate or Load Your Data

Start by generating a synthetic dataset or using your own data:

X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
                           n_redundant=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Train a Base Classifier

Select a classifier that you want to calibrate. Here, we'll use a RandomForestClassifier.

base_clf = RandomForestClassifier(n_estimators=100, random_state=42)
base_clf.fit(X_train, y_train)

Step 4: Calibrate the Classifier

Wrap the base classifier using CalibratedClassifierCV. You can choose from several calibration methods such as 'sigmoid' for Platt scaling or 'isotonic' for isotonic regression.

calibrated_clf = CalibratedClassifierCV(base_estimator=base_clf, method='sigmoid', cv=5)
calibrated_clf.fit(X_train, y_train)

Evaluation

After training, evaluate the calibrated classifier both in terms of performance metrics and calibration metrics such as the Brier score.

y_pred = calibrated_clf.predict(X_test)
y_prob = calibrated_clf.predict_proba(X_test)[:, 1]

# Performance metrics
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Calibration metric
brier_score = brier_score_loss(y_test, y_prob)

print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)
print("Confusion Matrix:\n", conf_matrix)
print("Brier Score: ", brier_score)

The Brier score is particularly useful here, providing a metric between 0 and 1 to describe how well the predicted probabilities accord to the true labels.

Key Takeaway

Model calibration is a crucial step when dealing with probability predictions in machine learning. By using Scikit-Learn’s CalibratedClassifierCV, you can ensure your models' probability estimates are feasible and reliable, thereby improving decision-making processes and analytical insights.

Next Article: Visualizing Calibration Curves with Scikit-Learn's `CalibrationDisplay`

Previous Article: A Guide to Using Scikit-Learn's `ClusterMixin` for Clustering Tasks

Series: Scikit-Learn Tutorials

Scikit-Learn

You May Also Like

  • Generating Gaussian Quantiles with Scikit-Learn
  • Spectral Biclustering with Scikit-Learn
  • Scikit-Learn Complete Cheat Sheet
  • ValueError: Estimator Does Not Support Sparse Input in Scikit-Learn
  • Scikit-Learn TypeError: Cannot Broadcast Due to Shape Mismatch
  • AttributeError: 'dict' Object Has No Attribute 'predict' in Scikit-Learn
  • KeyError: Missing 'param_grid' in Scikit-Learn GridSearchCV
  • Scikit-Learn ValueError: 'max_iter' Must Be Positive Integer
  • Fixing Log Function Error with Negative Values in Scikit-Learn
  • RuntimeError: Distributed Computing Backend Not Found in Scikit-Learn
  • Scikit-Learn TypeError: '<' Not Supported Between 'str' and 'int'
  • AttributeError: GridSearchCV Has No Attribute 'fit_transform' in Scikit-Learn
  • Fixing Scikit-Learn Split Error: Number of Splits > Number of Samples
  • Scikit-Learn TypeError: Cannot Concatenate 'str' and 'int'
  • ValueError: Cannot Use 'predict' Before Fitting Model in Scikit-Learn
  • Fixing AttributeError: NoneType Has No Attribute 'predict' in Scikit-Learn
  • Scikit-Learn ValueError: Cannot Reshape Array of Incorrect Size
  • LinAlgError: Matrix is Singular to Machine Precision in Scikit-Learn
  • Fixing TypeError: ndarray Object is Not Callable in Scikit-Learn