Core Estimators#
The two main entry points for running nested cross-validation in nestkit.
Classifier#
- class nestkit.NestedCVClassifier(estimator, param_grid, *, search_strategy='grid', outer_cv=5, inner_cv=5, scoring=None, refit=True, return_train_score=False, return_estimator=True, error_score='raise', n_jobs_outer=None, n_jobs_inner=None, verbose=0, random_state=None, callbacks=None, pre_dispatch='2*n_jobs', calibration_method=None, threshold_strategy=None, threshold_criterion='youden', threshold_beta=1.0, cost_matrix=None, min_recall=None, calibration_cv=None, conformal_prediction=False, conformal_alpha=0.1)[source]#
Bases:
_BaseNestedCVNested cross-validation for classification tasks.
Supports binary and multiclass classification. Extends
_BaseNestedCVwith optional post-hoc probability calibration and decision-threshold optimization. Both features are disabled by default and must be explicitly enabled.When calibration is enabled, out-of-fold (OOF) predictions from the inner CV are used to fit a calibrator, which is then applied to the outer test-set probabilities. When threshold optimization is enabled, the optimal decision boundary is selected on the calibrated (or raw) OOF probabilities.
- Parameters:
estimator (estimator object) – A scikit-learn compatible classifier that implements
fitandpredict_proba. Cloned for each outer fold.param_grid (dict or list of dict) – Hyperparameter search space. See
GridSearchCV.search_strategy ({'grid', 'random', 'bayesian'}, default='grid') – Inner hyperparameter search strategy.
outer_cv (int, cross-validation generator, or iterable, default=5) – Outer cross-validation splitting strategy.
inner_cv (int, cross-validation generator, or iterable, default=5) – Inner cross-validation splitting strategy.
scoring (str, callable, list, tuple, or dict, default=None) – Scoring metric(s) for the inner search.
refit (bool or str, default=True) – Whether to refit on the full outer training set.
return_train_score (bool, default=False) – Whether to include training scores in inner CV results.
return_estimator (bool, default=True) – Whether to store fitted estimators per outer fold.
error_score ('raise' or numeric, default='raise') – Value assigned on inner CV fitting errors.
n_jobs_outer (int or None, default=None) – Number of parallel jobs for outer folds.
n_jobs_inner (int or None, default=None) – Number of parallel jobs for inner search.
verbose (int, default=0) – Verbosity level.
random_state (int, RandomState instance, or None, default=None) – Random state for reproducibility.
callbacks (list of callback objects or None, default=None) –
FoldCallbackinstances for monitoring.pre_dispatch (int or str, default='2*n_jobs') – Controls job dispatch for parallel execution.
calibration_method ({'sigmoid', 'isotonic', 'beta', 'venn_abers'} or None, default=None) – Post-hoc calibration method. If
None, no calibration is applied.'sigmoid'corresponds to Platt scaling,'isotonic'to isotonic regression,'beta'to beta calibration, and'venn_abers'to Venn-ABERS prediction.threshold_strategy ({'pooled', 'fold_specific'} or None, default=None) – Threshold optimization strategy. If
None, no threshold optimization is performed.'pooled'selects a single threshold from all OOF predictions;'fold_specific'selects a per-fold threshold.threshold_criterion (str or callable, default='youden') – Criterion for threshold selection. Built-in options:
'youden','f_beta','cost','balanced_accuracy','precision_at_recall'. A custom callable must accept(y_true, y_proba, threshold)and return afloatto be maximised.threshold_beta (float, default=1.0) – Beta parameter for the F-beta criterion. Only used when
threshold_criterion='f_beta'.cost_matrix (array-like of shape (2, 2) or None, default=None) – Cost matrix
[[TN_cost, FP_cost], [FN_cost, TP_cost]]for cost-sensitive threshold optimization. Required whenthreshold_criterion='cost'.min_recall (float or None, default=None) – Minimum recall constraint for the
'precision_at_recall'criterion. Required whenthreshold_criterion='precision_at_recall'.calibration_cv (int, cross-validation generator, or None, default=None) – CV strategy for generating OOF calibration predictions. If
None, uses the sameinner_cvstrategy. Note that wheninner_cvis an integer, a new splitter instance is created for the calibration OOF loop, which may produce different fold assignments than the inner hyperparameter search.conformal_prediction (bool, default=False) – If
True, compute CV+ Mondrian conformal prediction sets using inner out-of-fold probabilities (calibrated if calibration is enabled). Each outer fold gets its own per-class q-hat threshold, applied to the held-out test fold.conformal_alpha (float, default=0.1) – Significance level (miscoverage rate) for conformal prediction. Target coverage is
1 - alpha. Must be in(0, 1).
Notes
Enabling calibration and/or threshold optimization roughly doubles computation time per outer fold, as the inner CV folds must be re-run to produce OOF probability estimates for the calibrator and threshold optimizer.
For multiclass tasks, calibration is applied independently per class using a one-vs-rest (OVR) decomposition. After calibration the per-class probabilities are renormalized to sum to 1. Because each calibrator is fitted on a marginal binary problem, the resulting multiclass probabilities may not be jointly well-calibrated – this is a known limitation of OVR calibration approaches.
Examples
Basic classification:
>>> from sklearn.datasets import load_breast_cancer >>> from sklearn.ensemble import RandomForestClassifier >>> from nestkit import NestedCVClassifier >>> X, y = load_breast_cancer(return_X_y=True) >>> ncv = NestedCVClassifier( ... estimator=RandomForestClassifier(random_state=42), ... param_grid={"n_estimators": [50, 100], "max_depth": [3, 5]}, ... outer_cv=5, inner_cv=3, random_state=42, ... ) >>> ncv.fit(X, y) >>> print(ncv.results_.summary_default_)
With calibration and threshold optimization:
>>> ncv = NestedCVClassifier( ... estimator=RandomForestClassifier(random_state=42), ... param_grid={"n_estimators": [50, 100]}, ... outer_cv=5, inner_cv=3, ... calibration_method="isotonic", ... threshold_strategy="pooled", ... threshold_criterion="youden", ... random_state=42, ... ) >>> ncv.fit(X, y)
See also
nestkit.NestedCVRegressorRegression-specific nested CV.
nestkit.calibration.PostHocCalibratorStandalone calibrator.
nestkit.thresholding.strategies.PooledThresholdPooled threshold strategy.
- fit(X, y, groups=None, **fit_params)[source]#
Run nested cross-validation with optional calibration and thresholding.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target labels.
groups (array-like of shape (n_samples,) or None, default=None) – Group labels for group-aware CV splitters.
**fit_params (dict) – Additional keyword arguments forwarded to the estimator’s
fitmethod.
- Returns:
The fitted estimator. Results are accessible via
results_.- Return type:
self
- Raises:
ValueError – If calibration or threshold parameters are invalid.
- set_fit_request(*, groups='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
groupsparameter infit.self (NestedCVClassifier)
- Returns:
self – The updated object.
- Return type:
Regressor#
- class nestkit.NestedCVRegressor(estimator, param_grid, *, search_strategy='grid', outer_cv=5, inner_cv=5, scoring=None, refit=True, return_train_score=False, return_estimator=True, error_score='raise', n_jobs_outer=None, n_jobs_inner=None, verbose=0, random_state=None, callbacks=None, pre_dispatch='2*n_jobs', prediction_intervals=False, confidence_level=0.95, mondrian_bins=None, mondrian_min_bin_size=20)[source]#
Bases:
_BaseNestedCVNested cross-validation for regression tasks.
Extends
_BaseNestedCVwith support for residual-based prediction intervals. Whenprediction_intervals=True, inner out-of-fold residuals are collected and their quantiles (with finite-sample correction) are used to construct prediction intervals on the outer test set.Note
The residuals are collected from OOF models fitted with the best hyperparameters, but the final model is refitted on the full outer training set. The residual distribution may therefore not perfectly match the final model’s errors. These intervals are approximate and do not carry formal conformal coverage guarantees.
- Parameters:
estimator (estimator object) – A scikit-learn compatible regressor that implements
fitandpredict. Cloned for each outer fold.param_grid (dict or list of dict) – Hyperparameter search space.
search_strategy ({'grid', 'random', 'bayesian'}, default='grid') – Inner hyperparameter search strategy.
outer_cv (int, cross-validation generator, or iterable, default=5) – Outer cross-validation splitting strategy.
inner_cv (int, cross-validation generator, or iterable, default=5) – Inner cross-validation splitting strategy.
scoring (str, callable, list, tuple, or dict, default=None) – Scoring metric(s) for the inner search.
refit (bool or str, default=True) – Whether to refit on the full outer training set.
return_train_score (bool, default=False) – Whether to include training scores in inner CV results.
return_estimator (bool, default=True) – Whether to store fitted estimators per outer fold.
error_score ('raise' or numeric, default='raise') – Value assigned on inner CV fitting errors.
n_jobs_outer (int or None, default=None) – Number of parallel jobs for outer folds.
n_jobs_inner (int or None, default=None) – Number of parallel jobs for inner search.
verbose (int, default=0) – Verbosity level.
random_state (int, RandomState instance, or None, default=None) – Random state for reproducibility.
callbacks (list of callback objects or None, default=None) –
FoldCallbackinstances for monitoring.pre_dispatch (int or str, default='2*n_jobs') – Controls job dispatch for parallel execution.
prediction_intervals (bool, default=False) – Whether to compute residual-based prediction intervals using inner out-of-fold residuals. When enabled, the results contain
prediction_interval_lowerandprediction_interval_upperper outer fold.confidence_level (float, default=0.95) – Confidence level for prediction intervals (e.g., 0.95 for 95% intervals). Only used when
prediction_intervals=True.mondrian_bins (int or None, default=None) – Number of Mondrian bins for conditional prediction intervals. When set (and
prediction_intervals=True), OOF predictions are grouped into equal-frequency bins and per-bin residual quantiles are used instead of global quantiles. This yields tighter intervals for easy-to-predict regions.mondrian_min_bin_size (int, default=20) – Minimum number of calibration samples per Mondrian bin. Bins with fewer samples are merged with their nearest neighbour.
Examples
>>> from sklearn.datasets import load_diabetes >>> from sklearn.linear_model import Ridge >>> from nestkit import NestedCVRegressor >>> X, y = load_diabetes(return_X_y=True) >>> ncv = NestedCVRegressor( ... estimator=Ridge(), ... param_grid={"alpha": [0.01, 0.1, 1.0, 10.0]}, ... outer_cv=5, inner_cv=3, ... prediction_intervals=True, ... random_state=42, ... ) >>> ncv.fit(X, y) >>> print(ncv.results_.summary_default_)
See also
nestkit.NestedCVClassifierClassification-specific nested CV.
- fit(X, y, groups=None, **fit_params)[source]#
Run nested cross-validation with optional prediction intervals.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
groups (array-like of shape (n_samples,) or None, default=None) – Group labels for group-aware CV splitters.
**fit_params (dict) – Additional keyword arguments forwarded to the estimator’s
fitmethod.
- Return type:
self
- set_fit_request(*, groups='$UNCHANGED$')#
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
groups (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
groupsparameter infit.self (NestedCVRegressor)
- Returns:
self – The updated object.
- Return type: