Python

Schema-drift detector between two extracts

Compares columns and dtypes of two monthly extracts of the same feed: added/removed columns, changed types, and a blocking-or-not verdict for the downstream pipeline.

Prerequisites

Python 3.9+, pandas

Python

import pandas as pd

ancien = pd.read_csv("export_2026-05.csv", nrows=500)
nouveau = pd.read_csv("export_2026-06.csv", nrows=500)

ajoutees = sorted(set(nouveau.columns) - set(ancien.columns))
retirees = sorted(set(ancien.columns) - set(nouveau.columns))
communes = set(ancien.columns) & set(nouveau.columns)
types_changes = [(c, str(ancien[c].dtype), str(nouveau[c].dtype))
                 for c in sorted(communes)
                 if ancien[c].dtype != nouveau[c].dtype]

print("Dérive de schéma : export mai → juin")
print(f"  colonnes ajoutées : {ajoutees or 'aucune'}")
print(f"  colonnes retirées : {retirees or 'aucune'}")
print(f"  types modifiés    : {len(types_changes)}")
for col, t0, t1 in types_changes:
    print(f"    - {col:<14} {t0:>8} → {t1}")
print("verdict :", "BLOQUANT pour le pipeline"
      if retirees or types_changes else "compatible, rien à faire")

Result

Dérive de schéma : export mai → juin
  colonnes ajoutées : ['canal_vente', 'code_promo']
  colonnes retirées : ['ancien_ref']
  types modifiés    : 2
    - code_postal      int64 → object
    - montant        float64 → object
verdict : BLOQUANT pour le pipeline

SchémaData qualitypandasPipeline

Related snippets

← Back to the Data Lab