Bucketing: pre-shuffle de una tabla de union recurrente
Tablas bucketed sobre la misma clave y el mismo numero de buckets: las uniones y agregaciones posteriores evitan el shuffle.
Requisitos
PySpark 3.x, metastore (Hive/Glue)
Python
(
facts.write
.bucketBy(64, "customer_id")
.sortBy("customer_id")
.mode("overwrite")
.saveAsTable("silver.facts_bucketed") # bucketing => saveAsTable
)
(
customers.write
.bucketBy(64, "customer_id") # même clé, même nombre
.mode("overwrite")
.saveAsTable("silver.customers_bucketed")
)
# Le join suivant ne déclenche AUCUN Exchange (vérifier avec explain)
j = spark.table("silver.facts_bucketed") \
.join(spark.table("silver.customers_bucketed"), "customer_id")Resultado
>>> j.explain() == Physical Plan == *(3) SortMergeJoin [customer_id#4], [customer_id#21], Inner :- *(1) Sort [customer_id#4 ASC NULLS FIRST], false, 0 : +- FileScan parquet silver.facts_bucketed ... SelectedBucketsCount: 64 out of 64 +- *(2) Sort [customer_id#21 ASC NULLS FIRST], false, 0 +- FileScan parquet silver.customers_bucketed ... SelectedBucketsCount: 64 out of 64 Aucun Exchange dans le plan : zero shuffle sur la jointure.
PySparkBucketingShuffleMetastore