Calibration#
Post-hoc probability calibration and diagnostic tools for evaluating calibration quality across outer folds.
PostHocCalibrator#
- class nestkit.calibration.PostHocCalibrator(method='sigmoid')[source]#
Bases:
objectUnified interface for post-hoc probability calibration.
Binary-only. For multiclass, use one calibrator per class via OVR decomposition at the NestedCVClassifier level.
- Parameters:
method ({"sigmoid", "isotonic", "beta", "venn_abers"}) – Calibration method.
"sigmoid"applies logistic recalibration on probability logits (sometimes called temperature scaling on probabilities). This differs from classical Platt scaling, which operates on raw decision function scores rather than probabilities.
CalibrationDiagnostics#
- class nestkit.calibration.CalibrationDiagnostics[source]#
Bases:
objectAssess calibration quality before and after post-hoc calibration.
All methods are static and operate on arrays of true labels and predicted probabilities. No fitting or state is required.
See also
nestkit.calibration.calibrators.PostHocCalibratorApply post-hoc calibration to predicted probabilities.
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> y_true = np.array([0, 0, 1, 1, 1]) >>> y_proba = np.array([0.1, 0.3, 0.6, 0.8, 0.9]) >>> CalibrationDiagnostics.brier_score(y_true, y_proba) 0.042...
- static expected_calibration_error(y_true, y_proba, n_bins=10, strategy='quantile')[source]#
Compute the Expected Calibration Error (ECE).
ECE is the weighted average of the absolute difference between observed accuracy and mean predicted confidence within each probability bin.
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities. If 2-D, the positive-class column is extracted.
n_bins (int, default 10) – Number of bins.
strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy.
"quantile"produces bins with approximately equal sample counts;"uniform"produces equal-width bins over [0, 1].
- Returns:
Expected Calibration Error in [0, 1].
- Return type:
Notes
ECE is defined as:
\[\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N} \left| \text{acc}(b) - \text{conf}(b) \right|\]where B is the number of bins, \(n_b\) the number of samples in bin b, N the total number of samples, \(\text{acc}(b)\) the observed accuracy in bin b, and \(\text{conf}(b)\) the mean predicted probability in bin b.
References
[1]Naeini, M.P., Cooper, G.F., and Hauskrecht, M. (2015). “Obtaining Well Calibrated Probabilities Using Bayesian Binning into Quantiles.” AAAI.
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> y_true = np.array([0, 0, 1, 1]) >>> y_proba = np.array([0.2, 0.3, 0.7, 0.8]) >>> CalibrationDiagnostics.expected_calibration_error( ... y_true, y_proba, n_bins=5 ... ) 0.0...
- static maximum_calibration_error(y_true, y_proba, n_bins=10, strategy='quantile')[source]#
Compute the Maximum Calibration Error (MCE).
MCE is the worst-case (maximum) absolute difference between observed accuracy and mean predicted confidence across all bins.
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.
n_bins (int, default 10) – Number of bins.
strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy.
"quantile"produces bins with approximately equal sample counts;"uniform"produces equal-width bins over [0, 1].
- Returns:
Maximum Calibration Error in [0, 1].
- Return type:
Notes
MCE is defined as:
\[\text{MCE} = \max_{b \in \{1, \ldots, B\}} \left| \text{acc}(b) - \text{conf}(b) \right|\]This is useful for identifying the single worst-calibrated probability region.
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> y_true = np.array([0, 0, 1, 1]) >>> y_proba = np.array([0.2, 0.3, 0.7, 0.8]) >>> CalibrationDiagnostics.maximum_calibration_error( ... y_true, y_proba, n_bins=5 ... ) 0.0...
- static brier_score(y_true, y_proba)[source]#
Compute the Brier score (mean squared error of probabilities).
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.
- Returns:
Brier score in [0, 1]. Lower is better; 0 indicates perfect probabilistic predictions.
- Return type:
Notes
The Brier score is defined as:
\[\text{BS} = \frac{1}{N} \sum_{i=1}^{N} (p_i - y_i)^2\]It can be decomposed into reliability, resolution, and uncertainty (see
brier_decomposition()).References
[1]Brier, G.W. (1950). “Verification of Forecasts Expressed in Terms of Probability.” Monthly Weather Review, 78(1), 1–3.
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> CalibrationDiagnostics.brier_score( ... np.array([0, 1]), np.array([0.0, 1.0]) ... ) 0.0
See also
brier_decompositionReliability–resolution–uncertainty decomposition of the Brier score.
- static brier_decomposition(y_true, y_proba, n_bins=10, strategy='quantile')[source]#
Decompose the Brier score into reliability, resolution, and uncertainty.
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.
n_bins (int, default 10) – Number of bins.
strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy.
"quantile"produces bins with approximately equal sample counts;"uniform"produces equal-width bins over [0, 1].
- Returns:
Dictionary with keys:
"reliability"– Calibration component (lower is better). Measures how close the predicted probabilities are to the observed frequencies within each bin."resolution"– Resolution component (higher is better). Measures how much the per-bin frequencies deviate from the overall base rate."uncertainty"– Uncertainty component. Equal top_bar * (1 - p_bar)wherep_baris the base rate. This is independent of the model.
- Return type:
Notes
The decomposition satisfies:
\[\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}\]References
[1]Murphy, A.H. (1973). “A New Vector Partition of the Probability Score.” Journal of Applied Meteorology, 12(4), 595–600.
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> decomp = CalibrationDiagnostics.brier_decomposition( ... np.array([0, 0, 1, 1]), ... np.array([0.1, 0.2, 0.8, 0.9]), ... ) >>> decomp.keys() dict_keys(['reliability', 'resolution', 'uncertainty'])
See also
brier_scoreThe scalar Brier score.
- static reliability_diagram_data(y_true, y_proba, n_bins=10, strategy='quantile')[source]#
Compute binned data for reliability (calibration) diagrams.
Returns a DataFrame with one row per bin, suitable for plotting a reliability diagram (mean predicted probability vs. observed fraction of positives).
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
y_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Predicted probabilities.
n_bins (int, default 10) – Number of bins.
strategy ({"quantile", "uniform"}, default "quantile") – Binning strategy.
"quantile"produces bins with approximately equal sample counts;"uniform"produces equal-width bins over [0, 1].
- Returns:
DataFrame with columns:
bin_lower– Lower edge of the bin.bin_upper– Upper edge of the bin.bin_mid– Midpoint of the bin.mean_predicted– Mean predicted probability in the bin (NaN if the bin is empty).fraction_positive– Observed fraction of positive samples in the bin (NaN if empty).count– Number of samples in the bin.
- Return type:
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> df = CalibrationDiagnostics.reliability_diagram_data( ... np.array([0, 0, 1, 1]), ... np.array([0.1, 0.2, 0.8, 0.9]), ... ) >>> df.columns.tolist() ['bin_lower', 'bin_upper', 'bin_mid', 'mean_predicted', 'fraction_positive', 'count']
- static compare_before_after(y_true, raw_proba, cal_proba)[source]#
Side-by-side comparison of calibration metrics before and after calibration.
Computes ECE, MCE, and Brier score for both the raw and calibrated predicted probabilities and returns them in a two-row DataFrame.
- Parameters:
y_true (numpy.ndarray of shape (n_samples,)) – True binary labels (0 or 1).
raw_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Raw (uncalibrated) predicted probabilities.
cal_proba (numpy.ndarray of shape (n_samples,) or (n_samples, 2)) – Calibrated predicted probabilities.
- Returns:
DataFrame with columns
stage("raw"or"calibrated"),ece,mce,brier.- Return type:
Examples
>>> import numpy as np >>> from nestkit.calibration.diagnostics import CalibrationDiagnostics >>> y_true = np.array([0, 0, 1, 1]) >>> raw = np.array([0.3, 0.4, 0.6, 0.7]) >>> cal = np.array([0.2, 0.3, 0.7, 0.8]) >>> df = CalibrationDiagnostics.compare_before_after( ... y_true, raw, cal ... ) >>> df["stage"].tolist() ['raw', 'calibrated']
See also
expected_calibration_errorECE computation.
maximum_calibration_errorMCE computation.
brier_scoreBrier score computation.