Calibration#

Post-hoc probability calibration and diagnostic tools for evaluating calibration quality across outer folds.

PostHocCalibrator#

class nestkit.calibration.PostHocCalibrator(method='sigmoid')[source]#

Bases: object

Unified interface for post-hoc probability calibration.

Binary-only. For multiclass, use one calibrator per class via OVR decomposition at the NestedCVClassifier level.

Parameters:

method ({"sigmoid", "isotonic", "beta", "venn_abers"}) – Calibration method. "sigmoid" applies logistic recalibration on probability logits (sometimes called temperature scaling on probabilities). This differs from classical Platt scaling, which operates on raw decision function scores rather than probabilities.

fit(y_proba, y_true)[source]#

Fit calibration mapping from uncalibrated probs to calibrated.

Parameters:
  • y_proba (array of shape (n_samples,) or (n_samples, 2)) – Uncalibrated predicted probabilities.

  • y_true (array of shape (n_samples,)) – True binary labels.

Return type:

PostHocCalibrator

predict_proba(y_proba)[source]#

Apply calibration mapping.

Parameters:

y_proba (array of shape (n_samples,) or (n_samples, 2) or (n_samples,)) – Uncalibrated probabilities.

Returns:

calibrated_proba

Return type:

array of shape (n_samples, 2)

CalibrationDiagnostics#

class nestkit.calibration.CalibrationDiagnostics[source]#

Bases: object

Assess calibration quality before and after post-hoc calibration.

All methods are static and operate on arrays of true labels and predicted probabilities. No fitting or state is required.

See also

nestkit.calibration.calibrators.PostHocCalibrator

Apply post-hoc calibration to predicted probabilities.

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> y_true = np.array([0, 0, 1, 1, 1])
>>> y_proba = np.array([0.1, 0.3, 0.6, 0.8, 0.9])
>>> CalibrationDiagnostics.brier_score(y_true, y_proba)
0.042...
static expected_calibration_error(y_true, y_proba, n_bins=10, strategy='quantile')[source]#

Compute the Expected Calibration Error (ECE).

ECE is the weighted average of the absolute difference between observed accuracy and mean predicted confidence within each probability bin.

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities. If 2-D, the positive-class column is extracted.

  • n_bins (int, default 10) – Number of bins.

  • strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy. "quantile" produces bins with approximately equal sample counts; "uniform" produces equal-width bins over [0, 1].

Returns:

Expected Calibration Error in [0, 1].

Return type:

float

Notes

ECE is defined as:

\[\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|\]

where B is the number of bins, \(n_b\) the number of samples in bin b, N the total number of samples, \(\text{acc}(b)\) the observed accuracy in bin b, and \(\text{conf}(b)\) the mean predicted probability in bin b.

References

[1]

Naeini, M.P., Cooper, G.F., and Hauskrecht, M. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning into Quantiles.” AAAI.

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> y_true = np.array([0, 0, 1, 1])
>>> y_proba = np.array([0.2, 0.3, 0.7, 0.8])
>>> CalibrationDiagnostics.expected_calibration_error(
...     y_true, y_proba, n_bins=5
... )
0.0...
static maximum_calibration_error(y_true, y_proba, n_bins=10, strategy='quantile')[source]#

Compute the Maximum Calibration Error (MCE).

MCE is the worst-case (maximum) absolute difference between observed accuracy and mean predicted confidence across all bins.

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.

  • n_bins (int, default 10) – Number of bins.

  • strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy. "quantile" produces bins with approximately equal sample counts; "uniform" produces equal-width bins over [0, 1].

Returns:

Maximum Calibration Error in [0, 1].

Return type:

float

Notes

MCE is defined as:

\[\text{MCE} = \max_{b \in \{1, \ldots, B\}} \left| \text{acc}(b) - \text{conf}(b) \right|\]

This is useful for identifying the single worst-calibrated probability region.

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> y_true = np.array([0, 0, 1, 1])
>>> y_proba = np.array([0.2, 0.3, 0.7, 0.8])
>>> CalibrationDiagnostics.maximum_calibration_error(
...     y_true, y_proba, n_bins=5
... )
0.0...
static brier_score(y_true, y_proba)[source]#

Compute the Brier score (mean squared error of probabilities).

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.

Returns:

Brier score in [0, 1]. Lower is better; 0 indicates perfect probabilistic predictions.

Return type:

float

Notes

The Brier score is defined as:

\[\text{BS} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2\]

It can be decomposed into reliability, resolution, and uncertainty (see brier_decomposition()).

References

[1]

Brier, G.W. (1950). “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review, 78(1), 1–3.

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> CalibrationDiagnostics.brier_score(
...     np.array([0, 1]), np.array([0.0, 1.0])
... )
0.0

See also

brier_decomposition

Reliability–resolution–uncertainty decomposition of the Brier score.

static brier_decomposition(y_true, y_proba, n_bins=10, strategy='quantile')[source]#

Decompose the Brier score into reliability, resolution, and uncertainty.

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.

  • n_bins (int, default 10) – Number of bins.

  • strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy. "quantile" produces bins with approximately equal sample counts; "uniform" produces equal-width bins over [0, 1].

Returns:

Dictionary with keys:

  • "reliability" – Calibration component (lower is better). Measures how close the predicted probabilities are to the observed frequencies within each bin.

  • "resolution" – Resolution component (higher is better). Measures how much the per-bin frequencies deviate from the overall base rate.

  • "uncertainty" – Uncertainty component. Equal to p_bar * (1 - p_bar) where p_bar is the base rate. This is independent of the model.

Return type:

dict

Notes

The decomposition satisfies:

\[\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}\]

References

[1]

Murphy, A.H. (1973). “A New Vector Partition of the Probability Score.” Journal of Applied Meteorology, 12(4), 595–600.

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> decomp = CalibrationDiagnostics.brier_decomposition(
...     np.array([0, 0, 1, 1]),
...     np.array([0.1, 0.2, 0.8, 0.9]),
... )
>>> decomp.keys()
dict_keys(['reliability', 'resolution', 'uncertainty'])

See also

brier_score

The scalar Brier score.

static reliability_diagram_data(y_true, y_proba, n_bins=10, strategy='quantile')[source]#

Compute binned data for reliability (calibration) diagrams.

Returns a DataFrame with one row per bin, suitable for plotting a reliability diagram (mean predicted probability vs. observed fraction of positives).

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.

  • n_bins (int, default 10) – Number of bins.

  • strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy. "quantile" produces bins with approximately equal sample counts; "uniform" produces equal-width bins over [0, 1].

Returns:

DataFrame with columns:

  • bin_lower – Lower edge of the bin.

  • bin_upper – Upper edge of the bin.

  • bin_mid – Midpoint of the bin.

  • mean_predicted – Mean predicted probability in the bin (NaN if the bin is empty).

  • fraction_positive – Observed fraction of positive samples in the bin (NaN if empty).

  • count – Number of samples in the bin.

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> df = CalibrationDiagnostics.reliability_diagram_data(
...     np.array([0, 0, 1, 1]),
...     np.array([0.1, 0.2, 0.8, 0.9]),
... )
>>> df.columns.tolist()
['bin_lower', 'bin_upper', 'bin_mid', 'mean_predicted',
 'fraction_positive', 'count']
static compare_before_after(y_true, raw_proba, cal_proba)[source]#

Side-by-side comparison of calibration metrics before and after calibration.

Computes ECE, MCE, and Brier score for both the raw and calibrated predicted probabilities and returns them in a two-row DataFrame.

Parameters:
  • y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).

  • raw_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Raw (uncalibrated) predicted probabilities.

  • cal_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Calibrated predicted probabilities.

Returns:

DataFrame with columns stage ("raw" or "calibrated"), ece, mce, brier.

Return type:

pandas.DataFrame

Examples

>>> import numpy as np
>>> from nestkit.calibration.diagnostics import CalibrationDiagnostics
>>> y_true = np.array([0, 0, 1, 1])
>>> raw = np.array([0.3, 0.4, 0.6, 0.7])
>>> cal = np.array([0.2, 0.3, 0.7, 0.8])
>>> df = CalibrationDiagnostics.compare_before_after(
...     y_true, raw, cal
... )
>>> df["stage"].tolist()
['raw', 'calibrated']

See also

expected_calibration_error

ECE computation.

maximum_calibration_error

MCE computation.

brier_score

Brier score computation.