Spark

monotonically_increasing_id: unique yes, consecutive no

The id encodes the partition number in its high bits: values jump by billions between partitions. A global row_number is consecutive but serializes everything.

Prerequisites

PySpark 3.x

Python

from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Unique et croissant PAR partition — PAS consécutif globalement
df_id = df.withColumn("mono_id", F.monotonically_increasing_id())
df_id.select("order_id", "mono_id").show(4)

# Consécutif global : row_number sans partitionBy
# => tout le DataFrame passe sur UNE tâche, réserver aux petits volumes
df_seq = df.withColumn(
    "seq_id", F.row_number().over(Window.orderBy("created_at"))
)
# Ne JAMAIS utiliser mono_id comme clé de join entre deux exécutions :
# la valeur dépend du découpage en partitions, donc non reproductible.

Result

+--------+----------+
|order_id|   mono_id|
+--------+----------+
|   A-101|         0|
|   A-102|         1|
|   B-501|8589934592|
|   B-502|8589934593|
+--------+----------+
only showing top 4 rows

PySparkmonotonically_increasing_idSurrogate keyPièges

Related snippets

← Back to the Data Lab