pyspark.sql.functions.tuple_union_integer#

pyspark.sql.functions.tuple_union_integer(col1, col2, lgNomEntries=None, mode=None)[source]#

Returns the union of two Datasketches TupleSketch objects with integer summaries.

New in version 4.2.0.

Parameters
col1Column or column name

The first TupleSketch column

col2Column or column name

The second TupleSketch column

lgNomEntriesColumn or int, optional

The log-base-2 of nominal entries (must be between 4 and 26, defaults to 12)

modeColumn or str, optional

The summary mode: “sum” (default), “min”, “max”, or “alwaysone”

Returns
Column

The binary representation of the merged TupleSketch.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([(1, 10, 3, 30), (2, 20, 4, 40)], ["key1", "v1", "key2", "v2"])  # noqa
>>> df = df.agg(
...     sf.tuple_sketch_agg_integer("key1", "v1").alias("sketch1"),
...     sf.tuple_sketch_agg_integer("key2", "v2").alias("sketch2")
... )
>>> df.select(sf.tuple_sketch_estimate_integer(sf.tuple_union_integer(df.sketch1, "sketch2"))).show()  # noqa
+-----------------------------------------------------------------------------+
|tuple_sketch_estimate_integer(tuple_union_integer(sketch1, sketch2, 12, sum))|
+-----------------------------------------------------------------------------+
|                                                                          4.0|
+-----------------------------------------------------------------------------+