Model Comparison#

Statistical comparison of nested cross-validation results from multiple models, including corrected paired tests and multiple-comparison adjustments.

NestedCVComparator#

class nestkit.comparison.NestedCVComparator[source]#

Bases: object

Statistically rigorous comparison of nested cross-validation results.

Provides corrected paired t-tests (Nadeau & Bengio, 2003), Bayesian correlated t-tests (Benavoli et al., 2017), and Holm–Bonferroni multiple-comparison correction (Demsar, 2006) for comparing two or more models that were evaluated on identical outer folds.

All registered models must share the same outer-fold split indices; this is validated automatically when a new model is added.

_results#

Mapping from model name to its nested CV results object.

Type:

dict[str, _BaseNestedCVResults]

Examples

>>> comparator = NestedCVComparator()
>>> comparator.add("rf", rf_results)
>>> comparator.add("svm", svm_results)
>>> comparator.summary("accuracy")

See also

nestkit.comparison.statistical_tests.nadeau_bengio_corrected_ttest, nestkit.comparison.statistical_tests.bayesian_correlated_ttest

References

[1]

Nadeau, C. and Bengio, Y. (2003). “Inference for the Generalization Error.” Machine Learning, 52(3), 239–281.

[2]

Benavoli, A. et al. (2017). JMLR, 18(77), 1–36.

[3]

Demsar, J. (2006). JMLR, 7, 1–30.

add(name, results)[source]#

Register a model’s nested cross-validation results.

Parameters:
  • name (str) – Unique human-readable identifier for the model (e.g., "random_forest").

  • results (_BaseNestedCVResults) – Fitted nested CV results object. Must contain per-fold test_indices that match every previously registered model.

Raises:

ValueError – If the outer-fold structure of results does not match that of models already registered.

Return type:

None

See also

_validate_fold_alignment

Alignment check called internally.

summary(metric, threshold='default')[source]#

Produce a side-by-side summary table of all registered models.

For each model the table includes the mean, standard deviation, median, 95 % confidence interval (Nadeau–Bengio corrected, t-distribution), min, max, and inter-quartile range of the outer-fold scores.

Parameters:
  • metric (str) – Scoring metric key to summarise.

  • threshold ({"default", "optimized"}, default="default") – Which score variant to use.

Returns:

One row per model with columns model, mean, std, median, ci_lower, ci_upper, min, max, iqr.

Return type:

pandas.DataFrame

Examples

>>> comparator.summary("roc_auc")
corrected_paired_ttest(metric, model_a, model_b, threshold='default')[source]#

Perform the Nadeau–Bengio corrected paired t-test.

Accounts for the non-independence of cross-validation fold scores caused by overlapping training sets.

Parameters:
  • metric (str) – Scoring metric key.

  • model_a (str) – Name of the first model.

  • model_b (str) – Name of the second model.

  • threshold ({"default", "optimized"}, default="default") – Which score variant to use.

Returns:

Test results including t_statistic, p_value, mean_difference, corrected_std, ci_lower, ci_upper, n_folds, significant_at_005, significant_at_001.

Return type:

dict

See also

nestkit.comparison.statistical_tests.nadeau_bengio_corrected_ttest

References

[1]

Nadeau, C. and Bengio, Y. (2003). Machine Learning, 52(3), 239–281.

pairwise_corrected_ttest(metric, threshold='default')[source]#

Run corrected paired t-tests for every model pair.

All C(n, 2) pairwise Nadeau–Bengio tests are performed and then adjusted for multiple comparisons via the step-down Holm–Bonferroni procedure.

Parameters:
  • metric (str) – Scoring metric key.

  • threshold ({"default", "optimized"}, default="default") – Which score variant to use.

Returns:

One row per pair with columns model_a, model_b, all keys from nadeau_bengio_corrected_ttest(), and p_value_corrected.

Return type:

pandas.DataFrame

See also

corrected_paired_ttest

Single-pair test.

nestkit.comparison.statistical_tests.holm_bonferroni_correction

References

[1]

Demsar, J. (2006). JMLR, 7, 1–30.

bayesian_comparison(metric, model_a, model_b, rope=0.01, threshold='default')[source]#

Perform a Bayesian correlated t-test between two models.

Uses a Student-t posterior over the mean score difference and partitions the probability mass into three regions: model A better, practically equivalent (within the ROPE), and model B better.

Parameters:
  • metric (str) – Scoring metric key.

  • model_a (str) – Name of the first model.

  • model_b (str) – Name of the second model.

  • rope (float, default=0.01) – Half-width of the Region of Practical Equivalence.

  • threshold ({"default", "optimized"}, default="default") – Which score variant to use.

Returns:

Posterior probabilities and diagnostics: p_a_better, p_equivalent, p_b_better, rope, mean_difference, hdi_lower, hdi_upper.

Return type:

dict

See also

nestkit.comparison.statistical_tests.bayesian_correlated_ttest

References

[1]

Benavoli, A. et al. (2017). JMLR, 18(77), 1–36.

rank_models(metric, threshold='default')[source]#

Rank all registered models by mean outer-fold score.

Returns the same summary table as summary() sorted in descending order of mean with an additional rank column (1 = best).

Parameters:
  • metric (str) – Scoring metric key.

  • threshold ({"default", "optimized"}, default="default") – Which score variant to use.

Returns:

Sorted summary table with an extra rank column.

Return type:

pandas.DataFrame

See also

summary

Unsorted summary table.