Machine Learning

Machine Learning in the real world

The notebook of an ML practitioner who learned to distrust their own numbers. Three running themes: hunting down data leakage and inflated scores, statistical rigor in model comparison (paired bootstrap, 5x2cv, McNemar), and the leap from an academic score to a business decision (cost-based threshold, audited calibration). We show the buggy code AND the correct code, and always quantify the gap.

20 featured snippets

Native TargetEncoder: target encoding without leakagescikit-learn's TargetEncoder applies internal cross-fitting during fit_transform: each row is encoded using means computed without it, which neutralizes target leakage.
Purged cross-validation with embargo (finance)A temporal fold generator that removes observations adjacent to the test fold: essential when labels span multiple periods (H-horizon returns) and overlap.
Nested CV: estimate performance AFTER tuningA GridSearchCV score is optimistic because the hyperparameters were chosen on those very folds. The outer loop of a nested CV gives the unbiased estimate of the full procedure.
Threshold by minimizing expected business costWhen a false negative costs 50 times more than a false positive, the right threshold can't be read off any standard curve: you minimize the expected total cost directly on the validation set.
PSI: detect feature drift in productionThe Population Stability Index compares a variable's distribution between training and production. Usual thresholds: < 0.10 stable, 0.10-0.25 watch, > 0.25 major drift.
Adversarial validation: are train and test comparable?Train a classifier to tell train from production: an AUC near 0.5 means similar distributions; above 0.7, the most important features point to the source of the drift.
Export a full pipeline to ONNX and verify parityConverting the entire scikit-learn pipeline (preprocessing included) to ONNX, then a numerical assertion between the sklearn and onnxruntime outputs — the step you regret skipping.
Serialize the model WITH its traceability metadataA bare .joblib is a time bomb: embedding version, date, data hash, metrics, and expected columns in the same artifact makes every model auditable.
Monotonicity constraints: inject domain knowledgeForcing the model to respect known relationships (more debt never lowers risk): free regularization, robustness to noise, and a model you can defend before a committee.
Leak fixed: feature selection on the whole datasetSelecting features correlated with the target BEFORE cross-validation yields absurdly high AUC on pure noise — a worked, numbers-backed demonstration, then the fix via a pipeline.
Gradient boosting showdown: XGBoost, LightGBM, CatBoost, HistGBBenchmark of the four major gradient boosting implementations on the same tabular dataset: AUC, training time, prediction latency and serialized model size, all in a single decision table.
Bootstrapping the AUC difference: is the gap real?PAIRED bootstrap on the test set: both models are evaluated on the same draws, which isolates the quality difference from plain sampling noise. Verdict by confidence interval.
Seed variance: the same model trained ten timesTen runs identical except for the seed quantify the model's inherent noise: any tuning gain smaller than this variance is indistinguishable from chance — a guardrail to compute once per project.
SMOTE distorts probabilities: proof and fixA numbers-backed demonstration: after 50/50 rebalancing, predicted probabilities are 6x too high. The analytic prior correction (Elkan, 2001) brings them back to the true rate without retraining.
Split conformal prediction: a guaranteed 90% intervalSplit conformal prediction in 12 lines: the quantile of the residuals from a dedicated calibration set yields an interval whose coverage is mathematically guaranteed, regardless of the model.
Leak via duplicates: test rows already seen in trainHashing rows to detect exact duplicates between train and test, then AUC recomputed with and without them: the difference pins down exactly how much the reported score was inflated.
Suspect labels: detection via cross-validated confidence (cleanlab-style)Out-of-fold predictions give the probability the model assigns to the observed label: rows where this confidence is tiny are candidates for mislabeling, ranked for human review.
Null importances: is the importance significant?Fifty models trained on a shuffled target build the null distribution of each feature's importance: only a real importance that exceeds the 95th null percentile proves genuine signal.
Evaluating an ML trading signal: win rate, profit factor, expectancyAUC doesn't pay the bills: turning a model's probabilities into trading metrics — number of trades, win rate, profit factor, expectancy and maximum drawdown of the equity curve.
Recall at a fixed false positive rate: the fraud metricAn operational reading of the ROC curve: for each false positive budget (0.1%, 0.5%, 1%, 5%), the threshold to apply, the recall obtained and the daily alert volume the team will have to absorb.

← Back to the Data Lab