Architecture ============ Standard cross-validation conflates hyperparameter tuning with performance estimation, producing optimistically biased scores. Nested cross-validation fixes this by separating the two concerns into an outer loop (unbiased evaluation) and an inner loop (tuning). The 4-Phase Pipeline -------------------- For every outer fold, nestkit executes: .. list-table:: :header-rows: 1 :widths: 10 25 65 * - Phase - Name - Description * - 1 - Inner CV search - ``GridSearchCV`` or ``RandomizedSearchCV`` on the outer training fold to select the best hyperparameters. * - 2 - Post-hoc calibration - *(Classification, opt-in.)* Fit a calibrator on inner out-of-fold predictions so that calibrated probabilities are leakage-free. * - 3 - Threshold optimization - *(Classification, opt-in.)* Find the decision threshold that maximizes a chosen criterion on inner out-of-fold predictions. * - 3b - Conformal prediction - *(Opt-in.)* Compute per-class conformal quantile thresholds (classification) or per-bin residual quantiles (regression) from inner out-of-fold predictions for coverage-guaranteed prediction sets or intervals. * - 4 - Refit & evaluate - Refit with the best hyperparameters on the full outer training fold, then score on the held-out outer test fold. The outer test fold is never used for tuning, calibration, or thresholding. Probability Calibration ----------------------- Many classifiers produce poorly calibrated probabilities. The predicted confidence does not match the true positive rate. nestkit integrates post-hoc calibration directly into Phase 2, fitting on inner out-of-fold predictions to avoid leakage. Supported methods: **Platt scaling** (``"sigmoid"``), **isotonic regression** (``"isotonic"``), **beta calibration** (``"beta"``), and **Venn-ABERS** (``"venn_abers"``). .. code-block:: python ncv = NestedCVClassifier( estimator=GradientBoostingClassifier(), param_grid={...}, calibration_method="isotonic", ... ) ncv.fit(X, y) print(ncv.results_.calibration_summary_) Threshold Optimization ---------------------- The default 0.5 threshold is rarely optimal for imbalanced classes or asymmetric costs. nestkit searches for a better threshold on inner out-of-fold predictions (Phase 3), keeping the outer test fold untouched. Five built-in criteria: **Youden's J**, **F-beta**, **cost-sensitive**, **balanced accuracy**, and **precision-at-recall**. Two strategies: ``"pooled"`` (single threshold) and ``"fold_specific"`` (averaged per-fold thresholds). .. code-block:: python ncv = NestedCVClassifier( estimator=GradientBoostingClassifier(), param_grid={...}, calibration_method="isotonic", threshold_strategy="pooled", threshold_criterion="youden", ... ) ncv.fit(X, y) print(ncv.results_.summary_optimized_) Conformal Prediction -------------------- Traditional point predictions and calibrated probabilities do not quantify uncertainty with formal guarantees. **CV+ Mondrian conformal prediction** produces prediction sets (classification) or prediction intervals (regression) with finite-sample coverage guarantees. For **classification**, conformal prediction computes a per-class quantile threshold (q-hat) from inner OOF nonconformity scores. At test time, each class is included in the prediction set if its nonconformity score falls below its threshold. This is Mondrian (class-conditional), providing per-class coverage. With calibration enabled, conformal thresholds are computed from calibrated probabilities. For **regression**, Mondrian binning groups OOF predictions into equal-frequency bins and computes per-bin residual quantiles. This yields tighter intervals in easy-to-predict regions and wider intervals elsewhere. .. code-block:: python # Classification: conformal prediction sets ncv = NestedCVClassifier( estimator=RandomForestClassifier(), param_grid={...}, calibration_method="isotonic", conformal_prediction=True, conformal_alpha=0.1, # 90% target coverage ... ) ncv.fit(X, y) print(ncv.results_.conformal_report()) # Regression: Mondrian prediction intervals ncv = NestedCVRegressor( estimator=Ridge(), param_grid={...}, prediction_intervals=True, mondrian_bins=5, ... ) ncv.fit(X, y) print(ncv.results_.mondrian_coverage_per_bin_) Model Comparison ---------------- Comparing models with a naive paired *t*-test inflates Type I error because CV fold scores are correlated. nestkit's :class:`~nestkit.comparison.NestedCVComparator` provides proper statistical tests. - **Nadeau-Bengio corrected *t*-test**: adjusts variance for fold overlap. - **Bayesian correlated *t*-test with ROPE**: posterior probabilities for "A better", "equivalent", or "B better". - **Holm-Bonferroni correction**: controls family-wise error for multi-model comparisons. .. code-block:: python from nestkit.comparison import NestedCVComparator comparator = NestedCVComparator() comparator.add("RF", rf_results) comparator.add("GBM", gbm_results) print(comparator.corrected_paired_ttest(metric="roc_auc", model_a="GBM", model_b="RF")) print(comparator.bayesian_comparison(metric="roc_auc", model_a="GBM", model_b="RF", rope=0.01)) Diagnostics ----------- **Hyperparameter stability**: if the inner search selects different hyperparameters across folds, the model may be too sensitive or the data too small. :class:`~nestkit.diagnostics.HyperparameterStability` reports selection frequency, entropy, agreement rate, and pairwise Jaccard similarity. **Generalization gap**: ``results.generalization_gap_`` compares inner CV scores to outer test scores per fold; a large gap signals overfitting during tuning. .. code-block:: python from nestkit.diagnostics import HyperparameterStability stab = HyperparameterStability(ncv.results_.best_params_per_fold_) print(stab.summary()) Feature Importance ------------------ Single-split importance scores are unreliable. :class:`~nestkit.importance.FeatureImportanceAggregator` extracts importances from each outer fold estimator (model-native or SHAP), aggregates them, and reports the **Nogueira stability index** to measure top-*k* feature consistency across folds. .. code-block:: python from nestkit.importance import FeatureImportanceAggregator agg = FeatureImportanceAggregator(ncv.results_, method="auto", feature_names=names) agg.compute() print(agg.summary_) print(f"Stability (k=10): {agg.stability_index(top_k=10):.3f}") Callbacks --------- The callback system hooks into the pipeline lifecycle for progress tracking, checkpointing, and logging. Built-in callbacks: :class:`~nestkit.callbacks.ProgressCallback`, :class:`~nestkit.callbacks.CheckpointCallback`, :class:`~nestkit.callbacks.LoggingCallback`. .. code-block:: python from nestkit.callbacks import ProgressCallback, CheckpointCallback ncv = NestedCVClassifier( ..., callbacks=[ProgressCallback(n_outer_folds=5), CheckpointCallback(path="./ckpt")], ) Plotting -------- nestkit includes 25+ plotting functions. All accept an optional ``ax`` parameter and return a matplotlib ``Axes``. Categories: fold scores, ROC/PR curves, confusion matrices, residuals, calibration diagrams, threshold sensitivity, comparison plots, critical difference diagrams, feature importance, and SHAP summaries. .. code-block:: python from nestkit.plotting import plot_roc_curves, plot_calibration_curves plot_roc_curves(ncv.results_) plot_calibration_curves(ncv.results_) Install plotting support with ``pip install nestkit[plotting]``. See the :ref:`API Reference ` for the full function list.