Spark

Bucketing: pre-shuffle de una tabla de union recurrente

Tablas bucketed sobre la misma clave y el mismo numero de buckets: las uniones y agregaciones posteriores evitan el shuffle.

Requisitos

PySpark 3.x, metastore (Hive/Glue)

Python
(
    facts.write
    .bucketBy(64, "customer_id")
    .sortBy("customer_id")
    .mode("overwrite")
    .saveAsTable("silver.facts_bucketed")     # bucketing => saveAsTable
)

(
    customers.write
    .bucketBy(64, "customer_id")              # même clé, même nombre
    .mode("overwrite")
    .saveAsTable("silver.customers_bucketed")
)

# Le join suivant ne déclenche AUCUN Exchange (vérifier avec explain)
j = spark.table("silver.facts_bucketed") \
         .join(spark.table("silver.customers_bucketed"), "customer_id")

Resultado

>>> j.explain()
== Physical Plan ==
*(3) SortMergeJoin [customer_id#4], [customer_id#21], Inner
:- *(1) Sort [customer_id#4 ASC NULLS FIRST], false, 0
:  +- FileScan parquet silver.facts_bucketed ... SelectedBucketsCount: 64 out of 64
+- *(2) Sort [customer_id#21 ASC NULLS FIRST], false, 0
   +- FileScan parquet silver.customers_bucketed ... SelectedBucketsCount: 64 out of 64

Aucun Exchange dans le plan : zero shuffle sur la jointure.
PySparkBucketingShuffleMetastore

Snippets relacionados

Volver al Data Lab