Model Comparison#
Statistical comparison of nested cross-validation results from multiple models, including corrected paired tests and multiple-comparison adjustments.
NestedCVComparator#
- class nestkit.comparison.NestedCVComparator[source]#
Bases:
objectStatistically rigorous comparison of nested cross-validation results.
Provides corrected paired t-tests (Nadeau & Bengio, 2003), Bayesian correlated t-tests (Benavoli et al., 2017), and Holm–Bonferroni multiple-comparison correction (Demsar, 2006) for comparing two or more models that were evaluated on identical outer folds.
All registered models must share the same outer-fold split indices; this is validated automatically when a new model is added.
- _results#
Mapping from model name to its nested CV results object.
Examples
>>> comparator = NestedCVComparator() >>> comparator.add("rf", rf_results) >>> comparator.add("svm", svm_results) >>> comparator.summary("accuracy")
See also
nestkit.comparison.statistical_tests.nadeau_bengio_corrected_ttest,nestkit.comparison.statistical_tests.bayesian_correlated_ttestReferences
[1]Nadeau, C. and Bengio, Y. (2003). “Inference for the Generalization Error.” Machine Learning, 52(3), 239–281.
[2]Benavoli, A. et al. (2017). JMLR, 18(77), 1–36.
[3]Demsar, J. (2006). JMLR, 7, 1–30.
- add(name, results)[source]#
Register a model’s nested cross-validation results.
- Parameters:
name (str) – Unique human-readable identifier for the model (e.g.,
"random_forest").results (_BaseNestedCVResults) – Fitted nested CV results object. Must contain per-fold
test_indicesthat match every previously registered model.
- Raises:
ValueError – If the outer-fold structure of results does not match that of models already registered.
- Return type:
None
See also
_validate_fold_alignmentAlignment check called internally.
- summary(metric, threshold='default')[source]#
Produce a side-by-side summary table of all registered models.
For each model the table includes the mean, standard deviation, median, 95 % confidence interval (Nadeau–Bengio corrected, t-distribution), min, max, and inter-quartile range of the outer-fold scores.
- Parameters:
metric (str) – Scoring metric key to summarise.
threshold ({"default", "optimized"}, default="default") – Which score variant to use.
- Returns:
One row per model with columns
model,mean,std,median,ci_lower,ci_upper,min,max,iqr.- Return type:
Examples
>>> comparator.summary("roc_auc")
- corrected_paired_ttest(metric, model_a, model_b, threshold='default')[source]#
Perform the Nadeau–Bengio corrected paired t-test.
Accounts for the non-independence of cross-validation fold scores caused by overlapping training sets.
- Parameters:
- Returns:
Test results including
t_statistic,p_value,mean_difference,corrected_std,ci_lower,ci_upper,n_folds,significant_at_005,significant_at_001.- Return type:
See also
nestkit.comparison.statistical_tests.nadeau_bengio_corrected_ttestReferences
[1]Nadeau, C. and Bengio, Y. (2003). Machine Learning, 52(3), 239–281.
- pairwise_corrected_ttest(metric, threshold='default')[source]#
Run corrected paired t-tests for every model pair.
All
C(n, 2)pairwise Nadeau–Bengio tests are performed and then adjusted for multiple comparisons via the step-down Holm–Bonferroni procedure.- Parameters:
metric (str) – Scoring metric key.
threshold ({"default", "optimized"}, default="default") – Which score variant to use.
- Returns:
One row per pair with columns
model_a,model_b, all keys fromnadeau_bengio_corrected_ttest(), andp_value_corrected.- Return type:
See also
corrected_paired_ttestSingle-pair test.
nestkit.comparison.statistical_tests.holm_bonferroni_correctionReferences
[1]Demsar, J. (2006). JMLR, 7, 1–30.
- bayesian_comparison(metric, model_a, model_b, rope=0.01, threshold='default')[source]#
Perform a Bayesian correlated t-test between two models.
Uses a Student-t posterior over the mean score difference and partitions the probability mass into three regions: model A better, practically equivalent (within the ROPE), and model B better.
- Parameters:
- Returns:
Posterior probabilities and diagnostics:
p_a_better,p_equivalent,p_b_better,rope,mean_difference,hdi_lower,hdi_upper.- Return type:
See also
nestkit.comparison.statistical_tests.bayesian_correlated_ttestReferences
[1]Benavoli, A. et al. (2017). JMLR, 18(77), 1–36.
- rank_models(metric, threshold='default')[source]#
Rank all registered models by mean outer-fold score.
Returns the same summary table as
summary()sorted in descending order ofmeanwith an additionalrankcolumn (1 = best).- Parameters:
metric (str) – Scoring metric key.
threshold ({"default", "optimized"}, default="default") – Which score variant to use.
- Returns:
Sorted summary table with an extra
rankcolumn.- Return type:
See also
summaryUnsorted summary table.