313 lines
15 KiB
ReStructuredText
313 lines
15 KiB
ReStructuredText
.. _calibration:
|
||
|
||
=======================
|
||
Probability calibration
|
||
=======================
|
||
|
||
.. currentmodule:: sklearn.calibration
|
||
|
||
|
||
When performing classification you often want not only to predict the class
|
||
label, but also obtain a probability of the respective label. This probability
|
||
gives you some kind of confidence on the prediction. Some models can give you
|
||
poor estimates of the class probabilities and some even do not support
|
||
probability prediction (e.g., some instances of
|
||
:class:`~sklearn.linear_model.SGDClassifier`).
|
||
The calibration module allows you to better calibrate
|
||
the probabilities of a given model, or to add support for probability
|
||
prediction.
|
||
|
||
Well calibrated classifiers are probabilistic classifiers for which the output
|
||
of the :term:`predict_proba` method can be directly interpreted as a confidence
|
||
level.
|
||
For instance, a well calibrated (binary) classifier should classify the samples such
|
||
that among the samples to which it gave a :term:`predict_proba` value close to, say,
|
||
0.8, approximately 80% actually belong to the positive class.
|
||
|
||
Before we show how to re-calibrate a classifier, we first need a way to detect how
|
||
good a classifier is calibrated.
|
||
|
||
.. note::
|
||
Strictly proper scoring rules for probabilistic predictions like
|
||
:func:`sklearn.metrics.brier_score_loss` and
|
||
:func:`sklearn.metrics.log_loss` assess calibration (reliability) and
|
||
discriminative power (resolution) of a model, as well as the randomness of the data
|
||
(uncertainty) at the same time. This follows from the well-known Brier score
|
||
decomposition of Murphy [1]_. As it is not clear which term dominates, the score is
|
||
of limited use for assessing calibration alone (unless one computes each term of
|
||
the decomposition). A lower Brier loss, for instance, does not necessarily
|
||
mean a better calibrated model, it could also mean a worse calibrated model with much
|
||
more discriminatory power, e.g. using many more features.
|
||
|
||
.. _calibration_curve:
|
||
|
||
Calibration curves
|
||
------------------
|
||
|
||
Calibration curves, also referred to as *reliability diagrams* (Wilks 1995 [2]_),
|
||
compare how well the probabilistic predictions of a binary classifier are calibrated.
|
||
It plots the frequency of the positive label (to be more precise, an estimation of the
|
||
*conditional event probability* :math:`P(Y=1|\text{predict_proba})`) on the y-axis
|
||
against the predicted probability :term:`predict_proba` of a model on the x-axis.
|
||
The tricky part is to get values for the y-axis.
|
||
In scikit-learn, this is accomplished by binning the predictions such that the x-axis
|
||
represents the average predicted probability in each bin.
|
||
The y-axis is then the *fraction of positives* given the predictions of that bin, i.e.
|
||
the proportion of samples whose class is the positive class (in each bin).
|
||
|
||
The top calibration curve plot is created with
|
||
:func:`CalibrationDisplay.from_estimator`, which uses :func:`calibration_curve` to
|
||
calculate the per bin average predicted probabilities and fraction of positives.
|
||
:func:`CalibrationDisplay.from_estimator`
|
||
takes as input a fitted classifier, which is used to calculate the predicted
|
||
probabilities. The classifier thus must have :term:`predict_proba` method. For
|
||
the few classifiers that do not have a :term:`predict_proba` method, it is
|
||
possible to use :class:`CalibratedClassifierCV` to calibrate the classifier
|
||
outputs to probabilities.
|
||
|
||
The bottom histogram gives some insight into the behavior of each classifier
|
||
by showing the number of samples in each predicted probability bin.
|
||
|
||
.. figure:: ../auto_examples/calibration/images/sphx_glr_plot_compare_calibration_001.png
|
||
:target: ../auto_examples/calibration/plot_compare_calibration.html
|
||
:align: center
|
||
|
||
.. currentmodule:: sklearn.linear_model
|
||
|
||
:class:`LogisticRegression` is more likely to return well calibrated predictions by itself as it has a
|
||
canonical link function for its loss, i.e. the logit-link for the :ref:`log_loss`.
|
||
In the unpenalized case, this leads to the so-called **balance property**, see [8]_ and :ref:`Logistic_regression`.
|
||
In the plot above, data is generated according to a linear mechanism, which is
|
||
consistent with the :class:`LogisticRegression` model (the model is 'well specified'),
|
||
and the value of the regularization parameter `C` is tuned to be
|
||
appropriate (neither too strong nor too low). As a consequence, this model returns
|
||
accurate predictions from its `predict_proba` method.
|
||
In contrast to that, the other shown models return biased probabilities; with
|
||
different biases per model.
|
||
|
||
.. currentmodule:: sklearn.naive_bayes
|
||
|
||
:class:`GaussianNB` (Naive Bayes) tends to push probabilities to 0 or 1 (note the counts
|
||
in the histograms). This is mainly because it makes the assumption that
|
||
features are conditionally independent given the class, which is not the
|
||
case in this dataset which contains 2 redundant features.
|
||
|
||
.. currentmodule:: sklearn.ensemble
|
||
|
||
:class:`RandomForestClassifier` shows the opposite behavior: the histograms
|
||
show peaks at probabilities approximately 0.2 and 0.9, while probabilities
|
||
close to 0 or 1 are very rare. An explanation for this is given by
|
||
Niculescu-Mizil and Caruana [3]_: "Methods such as bagging and random
|
||
forests that average predictions from a base set of models can have
|
||
difficulty making predictions near 0 and 1 because variance in the
|
||
underlying base models will bias predictions that should be near zero or one
|
||
away from these values. Because predictions are restricted to the interval
|
||
[0,1], errors caused by variance tend to be one-sided near zero and one. For
|
||
example, if a model should predict p = 0 for a case, the only way bagging
|
||
can achieve this is if all bagged trees predict zero. If we add noise to the
|
||
trees that bagging is averaging over, this noise will cause some trees to
|
||
predict values larger than 0 for this case, thus moving the average
|
||
prediction of the bagged ensemble away from 0. We observe this effect most
|
||
strongly with random forests because the base-level trees trained with
|
||
random forests have relatively high variance due to feature subsetting." As
|
||
a result, the calibration curve shows a characteristic sigmoid shape, indicating that
|
||
the classifier could trust its "intuition" more and return probabilities closer
|
||
to 0 or 1 typically.
|
||
|
||
.. currentmodule:: sklearn.svm
|
||
|
||
:class:`LinearSVC` (SVC) shows an even more sigmoid curve than the random forest, which
|
||
is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [3]_), which
|
||
focus on difficult to classify samples that are close to the decision boundary (the
|
||
support vectors).
|
||
|
||
Calibrating a classifier
|
||
------------------------
|
||
|
||
.. currentmodule:: sklearn.calibration
|
||
|
||
Calibrating a classifier consists of fitting a regressor (called a
|
||
*calibrator*) that maps the output of the classifier (as given by
|
||
:term:`decision_function` or :term:`predict_proba`) to a calibrated probability
|
||
in [0, 1]. Denoting the output of the classifier for a given sample by :math:`f_i`,
|
||
the calibrator tries to predict the conditional event probability
|
||
:math:`P(y_i = 1 | f_i)`.
|
||
|
||
Ideally, the calibrator is fit on a dataset independent of the training data used to
|
||
fit the classifier in the first place.
|
||
This is because performance of the classifier on its training data would be
|
||
better than for novel data. Using the classifier output of training data
|
||
to fit the calibrator would thus result in a biased calibrator that maps to
|
||
probabilities closer to 0 and 1 than it should.
|
||
|
||
Usage
|
||
-----
|
||
|
||
The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
|
||
|
||
:class:`CalibratedClassifierCV` uses a cross-validation approach to ensure
|
||
unbiased data is always used to fit the calibrator. The data is split into k
|
||
`(train_set, test_set)` couples (as determined by `cv`). When `ensemble=True`
|
||
(default), the following procedure is repeated independently for each
|
||
cross-validation split: a clone of `base_estimator` is first trained on the
|
||
train subset. Then its predictions on the test subset are used to fit a
|
||
calibrator (either a sigmoid or isotonic regressor). This results in an
|
||
ensemble of k `(classifier, calibrator)` couples where each calibrator maps
|
||
the output of its corresponding classifier into [0, 1]. Each couple is exposed
|
||
in the `calibrated_classifiers_` attribute, where each entry is a calibrated
|
||
classifier with a :term:`predict_proba` method that outputs calibrated
|
||
probabilities. The output of :term:`predict_proba` for the main
|
||
:class:`CalibratedClassifierCV` instance corresponds to the average of the
|
||
predicted probabilities of the `k` estimators in the `calibrated_classifiers_`
|
||
list. The output of :term:`predict` is the class that has the highest
|
||
probability.
|
||
|
||
When `ensemble=False`, cross-validation is used to obtain 'unbiased'
|
||
predictions for all the data, via
|
||
:func:`~sklearn.model_selection.cross_val_predict`.
|
||
These unbiased predictions are then used to train the calibrator. The attribute
|
||
`calibrated_classifiers_` consists of only one `(classifier, calibrator)`
|
||
couple where the classifier is the `base_estimator` trained on all the data.
|
||
In this case the output of :term:`predict_proba` for
|
||
:class:`CalibratedClassifierCV` is the predicted probabilities obtained
|
||
from the single `(classifier, calibrator)` couple.
|
||
|
||
The main advantage of `ensemble=True` is to benefit from the traditional
|
||
ensembling effect (similar to :ref:`bagging`). The resulting ensemble should
|
||
both be well calibrated and slightly more accurate than with `ensemble=False`.
|
||
The main advantage of using `ensemble=False` is computational: it reduces the
|
||
overall fit time by training only a single base classifier and calibrator
|
||
pair, decreases the final model size and increases prediction speed.
|
||
|
||
Alternatively an already fitted classifier can be calibrated by setting
|
||
`cv="prefit"`. In this case, the data is not split and all of it is used to
|
||
fit the regressor. It is up to the user to
|
||
make sure that the data used for fitting the classifier is disjoint from the
|
||
data used for fitting the regressor.
|
||
|
||
:class:`CalibratedClassifierCV` supports the use of two regression techniques
|
||
for calibration via the `method` parameter: `"sigmoid"` and `"isotonic"`.
|
||
|
||
.. _sigmoid_regressor:
|
||
|
||
Sigmoid
|
||
^^^^^^^
|
||
|
||
The sigmoid regressor, `method="sigmoid"` is based on Platt's logistic model [4]_:
|
||
|
||
.. math::
|
||
p(y_i = 1 | f_i) = \frac{1}{1 + \exp(A f_i + B)} \,,
|
||
|
||
where :math:`y_i` is the true label of sample :math:`i` and :math:`f_i`
|
||
is the output of the un-calibrated classifier for sample :math:`i`. :math:`A`
|
||
and :math:`B` are real numbers to be determined when fitting the regressor via
|
||
maximum likelihood.
|
||
|
||
The sigmoid method assumes the :ref:`calibration curve <calibration_curve>`
|
||
can be corrected by applying a sigmoid function to the raw predictions. This
|
||
assumption has been empirically justified in the case of :ref:`svm` with
|
||
common kernel functions on various benchmark datasets in section 2.1 of Platt
|
||
1999 [4]_ but does not necessarily hold in general. Additionally, the
|
||
logistic model works best if the calibration error is symmetrical, meaning
|
||
the classifier output for each binary class is normally distributed with
|
||
the same variance [7]_. This can be a problem for highly imbalanced
|
||
classification problems, where outputs do not have equal variance.
|
||
|
||
In general this method is most effective for small sample sizes or when the
|
||
un-calibrated model is under-confident and has similar calibration errors for both
|
||
high and low outputs.
|
||
|
||
Isotonic
|
||
^^^^^^^^
|
||
|
||
The `method="isotonic"` fits a non-parametric isotonic regressor, which outputs
|
||
a step-wise non-decreasing function, see :mod:`sklearn.isotonic`. It minimizes:
|
||
|
||
.. math::
|
||
\sum_{i=1}^{n} (y_i - \hat{f}_i)^2
|
||
|
||
subject to :math:`\hat{f}_i \geq \hat{f}_j` whenever
|
||
:math:`f_i \geq f_j`. :math:`y_i` is the true
|
||
label of sample :math:`i` and :math:`\hat{f}_i` is the output of the
|
||
calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
|
||
This method is more general when compared to 'sigmoid' as the only restriction
|
||
is that the mapping function is monotonically increasing. It is thus more
|
||
powerful as it can correct any monotonic distortion of the un-calibrated model.
|
||
However, it is more prone to overfitting, especially on small datasets [6]_.
|
||
|
||
Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
|
||
there is enough data (greater than ~ 1000 samples) to avoid overfitting [3]_.
|
||
|
||
.. note:: Impact on ranking metrics like AUC
|
||
|
||
It is generally expected that calibration does not affect ranking metrics such as
|
||
ROC-AUC. However, these metrics might differ after calibration when using
|
||
`method="isotonic"` since isotonic regression introduces ties in the predicted
|
||
probabilities. This can be seen as within the uncertainty of the model predictions.
|
||
In case, you strictly want to keep the ranking and thus AUC scores, use
|
||
`method="sigmoid"` which is a strictly monotonic transformation and thus keeps
|
||
the ranking.
|
||
|
||
Multiclass support
|
||
^^^^^^^^^^^^^^^^^^
|
||
|
||
Both isotonic and sigmoid regressors only
|
||
support 1-dimensional data (e.g., binary classification output) but are
|
||
extended for multiclass classification if the `base_estimator` supports
|
||
multiclass predictions. For multiclass predictions,
|
||
:class:`CalibratedClassifierCV` calibrates for
|
||
each class separately in a :ref:`ovr_classification` fashion [5]_. When
|
||
predicting
|
||
probabilities, the calibrated probabilities for each class
|
||
are predicted separately. As those probabilities do not necessarily sum to
|
||
one, a postprocessing is performed to normalize them.
|
||
|
||
.. rubric:: Examples
|
||
|
||
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration_curve.py`
|
||
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration_multiclass.py`
|
||
* :ref:`sphx_glr_auto_examples_calibration_plot_calibration.py`
|
||
* :ref:`sphx_glr_auto_examples_calibration_plot_compare_calibration.py`
|
||
|
||
.. rubric:: References
|
||
|
||
.. [1] Allan H. Murphy (1973).
|
||
:doi:`"A New Vector Partition of the Probability Score"
|
||
<10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2>`
|
||
Journal of Applied Meteorology and Climatology
|
||
|
||
.. [2] `On the combination of forecast probabilities for
|
||
consecutive precipitation periods.
|
||
<https://journals.ametsoc.org/waf/article/5/4/640/40179>`_
|
||
Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a
|
||
|
||
.. [3] `Predicting Good Probabilities with Supervised Learning
|
||
<https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf>`_,
|
||
A. Niculescu-Mizil & R. Caruana, ICML 2005
|
||
|
||
|
||
.. [4] `Probabilistic Outputs for Support Vector Machines and Comparisons
|
||
to Regularized Likelihood Methods.
|
||
<https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf>`_
|
||
J. Platt, (1999)
|
||
|
||
.. [5] `Transforming Classifier Scores into Accurate Multiclass
|
||
Probability Estimates.
|
||
<https://dl.acm.org/doi/pdf/10.1145/775047.775151>`_
|
||
B. Zadrozny & C. Elkan, (KDD 2002)
|
||
|
||
.. [6] `Predicting accurate probabilities with a ranking loss.
|
||
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/>`_
|
||
Menon AK, Jiang XJ, Vembu S, Elkan C, Ohno-Machado L.
|
||
Proc Int Conf Mach Learn. 2012;2012:703-710
|
||
|
||
.. [7] `Beyond sigmoids: How to obtain well-calibrated probabilities from
|
||
binary classifiers with beta calibration
|
||
<https://projecteuclid.org/euclid.ejs/1513306867>`_
|
||
Kull, M., Silva Filho, T. M., & Flach, P. (2017).
|
||
|
||
.. [8] Mario V. Wüthrich, Michael Merz (2023).
|
||
:doi:`"Statistical Foundations of Actuarial Learning and its Applications"
|
||
<10.1007/978-3-031-12409-9>`
|
||
Springer Actuarial
|