Bootstrapping the AUC difference: is the gap real?
PAIRED bootstrap on the test set: both models are evaluated on the same draws, which isolates the quality difference from plain sampling noise. Verdict by confidence interval.
Prerequisites
scikit-learn, numpy
Python
import numpy as np
from sklearn.metrics import roc_auc_score
proba_a = modele_a.predict_proba(X_test)[:, 1] # challenger
proba_b = modele_b.predict_proba(X_test)[:, 1] # champion
rng = np.random.default_rng(42)
n = len(y_test)
deltas = []
for _ in range(2000):
idx = rng.integers(0, n, n) # MÊME tirage pour les 2 modèles
y_b = y_test.values[idx]
if y_b.sum() in (0, len(y_b)):
continue
deltas.append(roc_auc_score(y_b, proba_a[idx])
- roc_auc_score(y_b, proba_b[idx]))
deltas = np.array(deltas)
lo, hi = np.percentile(deltas, [2.5, 97.5])
print(f"AUC A {roc_auc_score(y_test, proba_a):.4f} | "
f"AUC B {roc_auc_score(y_test, proba_b):.4f}")
print(f"delta AUC (A - B) : {deltas.mean():+.4f}")
print(f"IC 95% : [{lo:+.4f}, {hi:+.4f}]")
print("significatif :", "OUI" if lo > 0 or hi < 0 else "NON")Result
AUC A 0.8714 | AUC B 0.8590 delta AUC (A - B) : +0.0124 IC 95% : [+0.0031, +0.0218] significatif : OUI L'IC ne contient pas 0 : le challenger peut remplacer le champion. Le bootstrap apparié (mêmes tirages pour A et B) compare bien les modèles entre eux, pas le bruit d'échantillonnage du test set.
BootstrapAUCComparaisonIntervalle de confiance