Getting Started with nestkit#

This notebook demonstrates the core functionality of nestkit - a rigorous nested cross-validation toolkit for scikit-learn.

What you’ll learn#

How to run nested CV for binary classification
How to inspect and interpret results
How to create basic visualizations
How to add probability calibration
How to add threshold optimization
How to run nested CV for regression with prediction intervals

Prerequisites#

pip install nestkit[plotting]

[1]:

import warnings

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_breast_cancer, load_diabetes
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Ridge

from nestkit import NestedCVClassifier, NestedCVRegressor
from nestkit.plotting import (
    plot_calibration_curves,
    plot_calibration_improvement,
    plot_confusion_matrices,
    plot_inner_cv_heatmap,
    plot_outer_scores,
    plot_param_selection,
    plot_precision_recall_curves,
    plot_predicted_vs_actual,
    plot_prediction_intervals,
    plot_residual_qq,
    plot_residuals,
    plot_roc_curves,
    plot_score_stability,
    plot_threshold_comparison,
    plot_threshold_distribution,
    plot_threshold_sensitivity,
)

warnings.filterwarnings("ignore")

%matplotlib inline
plt.rcParams["figure.dpi"] = 100

1. Binary Classification with NestedCVClassifier#

We’ll use the breast cancer dataset (569 samples, 30 features, binary target) to demonstrate nested CV for classification.

Standard cross-validation conflates model selection with performance estimation, producing optimistically biased scores. Nested CV separates the two:

Inner loop - hyperparameter tuning on the outer training set
Outer loop - unbiased evaluation on the held-out outer test fold

[2]:

X, y = load_breast_cancer(return_X_y=True)
feature_names = list(load_breast_cancer().feature_names)
print(f"Shape: {X.shape}")
print(f"Classes: {np.unique(y)}, counts: {np.bincount(y)}")

Shape: (569, 30)
Classes: [0 1], counts: [212 357]

[3]:

ncv = NestedCVClassifier(
    estimator=RandomForestClassifier(random_state=42),
    param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, 10]},
    outer_cv=5,
    inner_cv=3,
    scoring="roc_auc",
    random_state=42,
)
ncv.fit(X, y)

[3]:

NestedCVClassifier(estimator=RandomForestClassifier(random_state=42),
                   inner_cv=3,
                   param_grid={'max_depth': [3, 5, 10],
                               'n_estimators': [50, 100, 200]},
                   random_state=42, scoring='roc_auc')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

NestedCVClassifier

iFitted

Parameters

	estimator	RandomForestC...ndom_state=42)
	param_grid	{'max_depth': [3, 5, ...], 'n_estimators': [50, 100, ...]}
	search_strategy	'grid'
	outer_cv	5
	inner_cv	3
	scoring	'roc_auc'
	refit	True
	return_train_score	False
	return_estimator	True
	error_score	'raise'
	n_jobs_outer	None
	n_jobs_inner	None
	verbose	0
	random_state	42
	callbacks	None
	pre_dispatch	'2*n_jobs'
	calibration_method	None
	threshold_strategy	None
	threshold_criterion	'youden'
	threshold_beta	1.0
	cost_matrix	None
	min_recall	None
	calibration_cv	None

estimator: RandomForestClassifier

RandomForestClassifier(random_state=42)

RandomForestClassifier

?Documentation for RandomForestClassifier

Parameters

	n_estimators	100
	criterion	'gini'
	max_depth	None
	min_samples_split	2
	min_samples_leaf	1
	min_weight_fraction_leaf	0.0
	max_features	'sqrt'
	max_leaf_nodes	None
	min_impurity_decrease	0.0
	bootstrap	True
	oob_score	False
	n_jobs	None
	random_state	42
	verbose	0
	warm_start	False
	class_weight	None
	ccp_alpha	0.0
	max_samples	None
	monotonic_cst	None

2. Inspecting Results#

After fitting, ncv.results_ returns a ClassifierResults object with summary statistics, per-fold predictions, best hyperparameters, and generalization gap analysis.

[4]:

results = ncv.results_
results.summary_default_

[4]:

	metric	mean	std	ci_lower	ci_upper	median	iqr
0	accuracy	0.959618	0.028121	0.907243	1.011993	0.973684	0.043704
1	balanced_accuracy	0.955556	0.032409	0.895194	1.015918	0.969246	0.057960
2	precision	0.964525	0.032206	0.904540	1.024509	0.972603	0.043254
3	recall	0.971909	0.024452	0.926368	1.017450	0.985915	0.014280
4	f1	0.967937	0.022510	0.926013	1.009862	0.979310	0.033333
5	roc_auc	0.990150	0.009537	0.972387	1.007913	0.991733	0.014053

[5]:

print("Best hyperparameters per fold:")
for i, params in enumerate(results.best_params_per_fold_):
    print(f"  Fold {i}: {params}")

print("\nParameter stability:")
results.param_stability_

Best hyperparameters per fold:
  Fold 0: {'max_depth': 10, 'n_estimators': 50}
  Fold 1: {'max_depth': 5, 'n_estimators': 100}
  Fold 2: {'max_depth': 5, 'n_estimators': 200}
  Fold 3: {'max_depth': 5, 'n_estimators': 200}
  Fold 4: {'max_depth': 5, 'n_estimators': 100}

Parameter stability:

[5]:

	parameter	mode	nunique	agreement_rate
0	max_depth	5	2	0.8
1	n_estimators	100	3	0.4

[6]:

# Out-of-fold predictions  -  each sample appears exactly once as a test sample
results.predictions_.head(10)

[6]:

	y_proba_raw_0	y_proba_raw_1
0	0.86	0.14
1	0.94	0.06
2	1.00	0.00
3	0.74	0.26
4	0.84	0.16
5	0.72	0.28
6	1.00	0.00
7	0.92	0.08
8	0.94	0.06
9	0.78	0.22

[7]:

# Generalization gap: difference between inner CV score and outer test score
results.generalization_gap_

[7]:

	fold_idx	best_inner_score	outer_accuracy	gap_accuracy	outer_balanced_accuracy	gap_balanced_accuracy	outer_precision	gap_precision	outer_recall	gap_recall	outer_f1	gap_f1	outer_roc_auc	gap_roc_auc
0	0	0.990890	0.921053	0.069837	0.918277	0.072613	0.942857	0.048032	0.929577	0.061312	0.936170	0.054719	0.976744	0.014145
1	1	0.993727	0.938596	0.055131	0.923190	0.070537	0.921053	0.072675	0.985915	0.007812	0.952381	0.041346	0.984605	0.009122
2	2	0.988471	0.982456	0.006015	0.981151	0.007320	0.986111	0.002360	0.986111	0.002360	0.986111	0.002360	0.999008	-0.010537
3	3	0.993157	0.973684	0.019473	0.969246	0.023911	0.972603	0.020554	0.986111	0.007046	0.979310	0.013847	0.991733	0.001424
4	4	0.990807	0.982301	0.008506	0.985915	0.004891	1.000000	-0.009193	0.971831	0.018976	0.985714	0.005093	0.998659	-0.007852

[8]:

# Pooled classification report across all outer folds
print(results.classification_report_pooled())

              precision    recall  f1-score   support

           0       0.95      0.94      0.95       212
           1       0.96      0.97      0.97       357

    accuracy                           0.96       569
   macro avg       0.96      0.96      0.96       569
weighted avg       0.96      0.96      0.96       569

3. Plotting#

nestkit provides 25+ plotting functions. All accept an optional ax parameter for embedding in custom figure layouts.

[9]:

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_outer_scores(results, "roc_auc", ax=axes[0])
plot_outer_scores(results, "accuracy", ax=axes[1])
plt.tight_layout()

[10]:

fig, axes = plt.subplots(1, 2, figsize=(16, 4))
plot_inner_cv_heatmap(results, ax=axes[0])
plot_score_stability(results, ax=axes[1])
plt.tight_layout()

[11]:

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
plot_roc_curves(results, ax=axes[0])
plot_precision_recall_curves(results, ax=axes[1])
plt.tight_layout()

[12]:

plot_confusion_matrices(results, normalize="true")

[12]:

<Axes: title={'center': 'Fold 0'}, xlabel='Predicted', ylabel='True'>

[13]:

# Which hyperparameters were selected across folds?
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_param_selection(results, "n_estimators", ax=axes[0])
plot_param_selection(results, "max_depth", ax=axes[1])
plt.tight_layout()

4. Adding Probability Calibration#

Many classifiers produce poorly calibrated probabilities - a predicted probability of 0.8 doesn’t mean the event occurs 80% of the time. nestkit supports four post-hoc calibration methods:

"sigmoid" - Platt scaling (logistic regression on raw probabilities)
"isotonic" - Isotonic regression (non-parametric)
"beta" - Beta calibration
"venn-abers" - Venn-ABERS prediction

Calibration is performed using out-of-fold (OOF) predictions from the inner CV to avoid data leakage.

[14]:

ncv_cal = NestedCVClassifier(
    estimator=RandomForestClassifier(random_state=42),
    param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, 10]},
    outer_cv=5,
    inner_cv=3,
    scoring="roc_auc",
    calibration_method="isotonic",
    random_state=42,
)
ncv_cal.fit(X, y)
results_cal = ncv_cal.results_

[15]:

print(f"Has calibration: {results_cal.has_calibration}")
print("\nCalibration summary (per-fold ECE, Brier scores):")
display(results_cal.calibration_summary_)
print("\nCalibration improvement (delta = raw - calibrated, positive = improvement):")
display(results_cal.calibration_improvement_)

Has calibration: True

Calibration summary (per-fold ECE, Brier scores):

	fold_idx	ece_raw	ece_calibrated	mce_raw	mce_calibrated	brier_raw	brier_calibrated
0	0	0.042456	0.019309	0.152000	0.175926	0.043832	0.045787
1	1	0.040486	0.052332	0.205041	0.344236	0.045089	0.054155
2	2	0.040498	0.016296	0.142264	0.080141	0.019292	0.015467
3	3	0.031108	0.007254	0.139114	0.011426	0.030284	0.028567
4	4	0.042710	0.019833	0.201191	0.244089	0.020169	0.016963


Calibration improvement (delta = raw - calibrated, positive = improvement):

	fold_idx	delta_ece	delta_brier
0	0.0	0.023147	-0.001955
1	1.0	-0.011845	-0.009066
2	2.0	0.024202	0.003825
3	3.0	0.023854	0.001717
4	4.0	0.022878	0.003206

[16]:

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
plot_calibration_curves(results_cal, ax=axes[0])
plot_calibration_improvement(results_cal, ax=axes[1])
plt.tight_layout()

5. Adding Threshold Optimization#

The default decision threshold of 0.5 is often suboptimal. nestkit can select an optimal threshold using several criteria:

"youden" - Maximizes Youden’s J statistic (sensitivity + specificity - 1)
"f_beta" - Maximizes F-beta score
"cost" - Minimizes expected cost given a cost matrix
"balanced_accuracy" - Maximizes balanced accuracy
"precision_at_recall" - Maximizes precision subject to a minimum recall

Two strategies are available:

"pooled" - A single threshold selected from all OOF predictions
"fold_specific" - A separate threshold per outer fold

[17]:

ncv_thr = NestedCVClassifier(
    estimator=RandomForestClassifier(random_state=42),
    param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, 10]},
    outer_cv=5,
    inner_cv=3,
    scoring="roc_auc",
    calibration_method="isotonic",
    threshold_strategy="pooled",
    threshold_criterion="youden",
    random_state=42,
)
ncv_thr.fit(X, y)
results_thr = ncv_thr.results_

[18]:

print(f"Threshold optimization: {results_thr.has_threshold_optimization}")
print(f"\nThresholds per fold: {results_thr.thresholds_per_fold_}")
print(f"\nThreshold stability: {results_thr.threshold_stability_}")

Threshold optimization: True

Thresholds per fold: [0.6009697  0.5009899  0.44456566 0.41684848 0.5009899 ]

Threshold stability: {'mean': 0.4928727272727273, 'std': 0.07058680883762491, 'cv': 0.14321508359382815, 'range': 0.18412121212121219}

[19]:

# Side-by-side comparison: default (0.5) vs optimized threshold
results_thr.threshold_comparison()

[19]:

	metric	mean_default	std_default	mean_optimized	std_optimized
0	accuracy	0.959603	0.020154	0.954339	0.028655
1	balanced_accuracy	0.953631	0.024760	0.953184	0.029577
2	precision	0.959211	0.026155	0.969723	0.029835
3	recall	0.977582	0.012588	0.957864	0.041130
4	f1	0.968157	0.015799	0.963120	0.024000
5	roc_auc	0.986868	0.012304	0.986868	0.012304

[20]:

fig, axes = plt.subplots(1, 3, figsize=(16, 4))
plot_threshold_sensitivity(results_thr, fold_idx=0, ax=axes[0])
plot_threshold_distribution(results_thr, bins=10, full_range=True, ax=axes[1])
plot_threshold_comparison(results_thr, ax=axes[2])
plt.tight_layout()

[21]:

# Confusion matrices with optimized threshold
plot_confusion_matrices(results_thr, threshold="optimized")

[21]:

<Axes: title={'center': 'Fold 0'}, xlabel='Predicted', ylabel='True'>

6. Regression with NestedCVRegressor#

nestkit also supports regression tasks. NestedCVRegressor provides the same nested CV framework with optional residual-based prediction intervals - uncertainty estimates derived from the distribution of inner CV residuals.

[22]:

X_reg, y_reg = load_diabetes(return_X_y=True)
print(f"Shape: {X_reg.shape}, target range: [{y_reg.min():.0f}, {y_reg.max():.0f}]")

Shape: (442, 10), target range: [25, 346]

[23]:

ncv_reg = NestedCVRegressor(
    estimator=Ridge(),
    param_grid={"alpha": [0.01, 0.1, 1.0, 10.0, 100.0]},
    outer_cv=5,
    inner_cv=3,
    prediction_intervals=True,
    random_state=42,
)
ncv_reg.fit(X_reg, y_reg)
results_reg = ncv_reg.results_

[24]:

print("Summary:")
display(results_reg.summary_default_)
print(f"\nResidual stats: {results_reg.residual_stats_}")
print(f"\nPrediction interval coverage: {results_reg.prediction_interval_coverage_}")

Summary:

	metric	mean	std	ci_lower	ci_upper	median	iqr
0	mse	2997.691750	143.786002	2729.890897	3265.492603	3001.755668	104.319504
1	rmse	54.738582	1.313080	52.292975	57.184189	54.788280	0.953475
2	mae	44.294273	2.179699	40.234593	48.353953	43.292787	2.100991
3	r2	0.481443	0.054231	0.380438	0.582447	0.489191	0.092220
4	mape	0.394409	0.032817	0.333288	0.455530	0.391332	0.044576


Residual stats: {'mean': 0.37101618492788285, 'std': 54.80893629831313, 'median': -2.1691076714726307, 'skewness': 0.05688818733292645, 'kurtosis': -0.2869504563698184}

Prediction interval coverage: {'mean': 0.9547242083758938, 'per_fold': [0.9550561797752809, 0.9662921348314607, 0.9659090909090909, 0.9204545454545454, 0.9659090909090909]}

[25]:

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
plot_predicted_vs_actual(results_reg, ax=axes[0, 0])
plot_residuals(results_reg, ax=axes[0, 1])
plot_residual_qq(results_reg, ax=axes[1, 0])
plot_prediction_intervals(results_reg, ax=axes[1, 1])
plt.tight_layout()

[26]:

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_outer_scores(results_reg, "r2", ax=axes[0])
plot_outer_scores(results_reg, "rmse", ax=axes[1])
plt.tight_layout()

Next Steps#

This notebook covered the core nestkit workflow. For advanced topics, see 02 - Advanced Workflows:

Multi-model statistical comparison with NestedCVComparator
Feature importance aggregation with FeatureImportanceAggregator
Hyperparameter stability diagnostics
Callbacks and persistence