True time window with rangeBetween over a timestamp
Aggregate over the last rolling 24 hours (not the last N rows) by converting the timestamp to seconds.
Prerequisites
PySpark 3.x
Python
from pyspark.sql import functions as F
from pyspark.sql.window import Window
DAY = 24 * 3600
w_24h = (
Window.partitionBy("card_id")
.orderBy(F.col("event_ts").cast("long")) # rangeBetween exige un numérique
.rangeBetween(-DAY, 0)
)
df_velocity = (
df.withColumn("tx_24h", F.count("*").over(w_24h))
.withColumn("amount_24h", F.sum("amount").over(w_24h))
.withColumn("suspect", (F.col("tx_24h") > 20) | (F.col("amount_24h") > 5000))
)Result
+--------+-------------------+------+------+----------+-------+ | card_id| event_ts|amount|tx_24h|amount_24h|suspect| +--------+-------------------+------+------+----------+-------+ |CARD-042|2026-06-09 08:12:44| 89.90| 3| 240.40| false| |CARD-042|2026-06-09 21:37:02|450.00| 24| 5840.90| true| |CARD-911|2026-06-09 23:58:51| 12.50| 2| 62.00| false| +--------+-------------------+------+------+----------+-------+
PySparkWindowrangeBetweenFraude