Architecture#

Standard cross-validation conflates hyperparameter tuning with performance estimation, producing optimistically biased scores. Nested cross-validation fixes this by separating the two concerns into an outer loop (unbiased evaluation) and an inner loop (tuning).

The 4-Phase Pipeline#

For every outer fold, nestkit executes:

Phase	Name	Description
1	Inner CV search	`GridSearchCV` or `RandomizedSearchCV` on the outer training fold to select the best hyperparameters.
2	Post-hoc calibration	(Classification, opt-in.) Fit a calibrator on inner out-of-fold predictions so that calibrated probabilities are leakage-free.
3	Threshold optimization	(Classification, opt-in.) Find the decision threshold that maximizes a chosen criterion on inner out-of-fold predictions.
3b	Conformal prediction	(Opt-in.) Compute per-class conformal quantile thresholds (classification) or per-bin residual quantiles (regression) from inner out-of-fold predictions for coverage-guaranteed prediction sets or intervals.
4	Refit & evaluate	Refit with the best hyperparameters on the full outer training fold, then score on the held-out outer test fold.

The outer test fold is never used for tuning, calibration, or thresholding.

Probability Calibration#

Many classifiers produce poorly calibrated probabilities. The predicted confidence does not match the true positive rate. nestkit integrates post-hoc calibration directly into Phase 2, fitting on inner out-of-fold predictions to avoid leakage.

Supported methods: Platt scaling ("sigmoid"), isotonic regression ("isotonic"), beta calibration ("beta"), and Venn-ABERS ("venn_abers").

ncv = NestedCVClassifier(
    estimator=GradientBoostingClassifier(),
    param_grid={...},
    calibration_method="isotonic",
    ...
)
ncv.fit(X, y)
print(ncv.results_.calibration_summary_)

Threshold Optimization#

The default 0.5 threshold is rarely optimal for imbalanced classes or asymmetric costs. nestkit searches for a better threshold on inner out-of-fold predictions (Phase 3), keeping the outer test fold untouched.

Five built-in criteria: Youden’s J, F-beta, cost-sensitive, balanced accuracy, and precision-at-recall. Two strategies: "pooled" (single threshold) and "fold_specific" (averaged per-fold thresholds).

ncv = NestedCVClassifier(
    estimator=GradientBoostingClassifier(),
    param_grid={...},
    calibration_method="isotonic",
    threshold_strategy="pooled",
    threshold_criterion="youden",
    ...
)
ncv.fit(X, y)
print(ncv.results_.summary_optimized_)

Conformal Prediction#

Traditional point predictions and calibrated probabilities do not quantify uncertainty with formal guarantees. CV+ Mondrian conformal prediction produces prediction sets (classification) or prediction intervals (regression) with finite-sample coverage guarantees.

For classification, conformal prediction computes a per-class quantile threshold (q-hat) from inner OOF nonconformity scores. At test time, each class is included in the prediction set if its nonconformity score falls below its threshold. This is Mondrian (class-conditional), providing per-class coverage. With calibration enabled, conformal thresholds are computed from calibrated probabilities.

For regression, Mondrian binning groups OOF predictions into equal-frequency bins and computes per-bin residual quantiles. This yields tighter intervals in easy-to-predict regions and wider intervals elsewhere.

# Classification: conformal prediction sets
ncv = NestedCVClassifier(
    estimator=RandomForestClassifier(),
    param_grid={...},
    calibration_method="isotonic",
    conformal_prediction=True,
    conformal_alpha=0.1,  # 90% target coverage
    ...
)
ncv.fit(X, y)
print(ncv.results_.conformal_report())

# Regression: Mondrian prediction intervals
ncv = NestedCVRegressor(
    estimator=Ridge(),
    param_grid={...},
    prediction_intervals=True,
    mondrian_bins=5,
    ...
)
ncv.fit(X, y)
print(ncv.results_.mondrian_coverage_per_bin_)

Model Comparison#

Comparing models with a naive paired t-test inflates Type I error because CV fold scores are correlated. nestkit’s NestedCVComparator provides proper statistical tests.

Nadeau-Bengio corrected *t*-test: adjusts variance for fold overlap.
Bayesian correlated *t*-test with ROPE: posterior probabilities for “A better”, “equivalent”, or “B better”.
Holm-Bonferroni correction: controls family-wise error for multi-model comparisons.

from nestkit.comparison import NestedCVComparator

comparator = NestedCVComparator()
comparator.add("RF", rf_results)
comparator.add("GBM", gbm_results)

print(comparator.corrected_paired_ttest(metric="roc_auc", model_a="GBM", model_b="RF"))
print(comparator.bayesian_comparison(metric="roc_auc", model_a="GBM", model_b="RF", rope=0.01))

Diagnostics#

Hyperparameter stability: if the inner search selects different hyperparameters across folds, the model may be too sensitive or the data too small. HyperparameterStability reports selection frequency, entropy, agreement rate, and pairwise Jaccard similarity.

Generalization gap: results.generalization_gap_ compares inner CV scores to outer test scores per fold; a large gap signals overfitting during tuning.

from nestkit.diagnostics import HyperparameterStability

stab = HyperparameterStability(ncv.results_.best_params_per_fold_)
print(stab.summary())

Feature Importance#

Single-split importance scores are unreliable. FeatureImportanceAggregator extracts importances from each outer fold estimator (model-native or SHAP), aggregates them, and reports the Nogueira stability index to measure top-k feature consistency across folds.

from nestkit.importance import FeatureImportanceAggregator

agg = FeatureImportanceAggregator(ncv.results_, method="auto", feature_names=names)
agg.compute()
print(agg.summary_)
print(f"Stability (k=10): {agg.stability_index(top_k=10):.3f}")

Callbacks#

The callback system hooks into the pipeline lifecycle for progress tracking, checkpointing, and logging.

Built-in callbacks: ProgressCallback, CheckpointCallback, LoggingCallback.

from nestkit.callbacks import ProgressCallback, CheckpointCallback

ncv = NestedCVClassifier(
    ...,
    callbacks=[ProgressCallback(n_outer_folds=5), CheckpointCallback(path="./ckpt")],
)

Plotting#

nestkit includes 25+ plotting functions. All accept an optional ax parameter and return a matplotlib Axes.

Categories: fold scores, ROC/PR curves, confusion matrices, residuals, calibration diagrams, threshold sensitivity, comparison plots, critical difference diagrams, feature importance, and SHAP summaries.

from nestkit.plotting import plot_roc_curves, plot_calibration_curves

plot_roc_curves(ncv.results_)
plot_calibration_curves(ncv.results_)

Install plotting support with pip install nestkit[plotting]. See the API Reference for the full function list.