Skip to main content

DATASKETCHES_HLL_UNION_AGG

Description

datasketches_hll_union_agg is an aggregate function used to union multiple Apache DataSketches HLL (hll_sketch) serialized values and return the estimated cardinality (approximate distinct count / NDV) after union.

This function expects the input to be serialized bytes of a DataSketches HLL sketch (for example, generated by hll_sketch.serialize_compact() in the DataSketches library). It does not accept arbitrary strings.

Aliases:

  • ds_hll_estimate
  • datasketches_hll_estimate

Syntax

datasketches_hll_union_agg(<sketch>)

Parameters

ParameterDescription
<sketch>The serialized bytes of an Apache DataSketches HLL sketch. Supported types: STRING / VARCHAR / VARBINARY. NULL values are ignored. Empty strings are treated as invalid input and will throw an error.

Return Value

Returns a DOUBLE (Float64) cardinality estimate value.
If there is no valid data in the group (or the input is empty), returns 0.
If the input bytes cannot be deserialized as a valid DataSketches HLL sketch (including empty string), an error is thrown (typically with error code CORRUPTION).

Example

-- setup
CREATE TABLE test_datasketches_hll_union_agg_tbl (
id INT,
sk STRING
)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_num" = "1");

-- The sketch bytes are inserted via Base64 decoding.
INSERT INTO test_datasketches_hll_union_agg_tbl VALUES
(1, from_base64('AgEHCAMIBwjL18IEK/L7BoYv+Q11gWYHgbxdBntl5gj8LUIK')),
(2, from_base64('AwEHCAUIAAkKAAAAIjvrBcS1nwfGGWoEyHokBO8t9wc1qTEENkcJB7hWqQxZf9QNnuSbGA==')),
(3, NULL);
-- The function returns DOUBLE, so use ROUND/CAST if you want an integer display.
SELECT CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT)
FROM test_datasketches_hll_union_agg_tbl;
+-------------------------------------------------------+
| CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT) |
+-------------------------------------------------------+
| 17 |
+-------------------------------------------------------+
-- aliases
SELECT
CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT) AS v1,
CAST(ROUND(ds_hll_estimate(sk)) AS BIGINT) AS v2,
CAST(ROUND(datasketches_hll_estimate(sk)) AS BIGINT) AS v3
FROM test_datasketches_hll_union_agg_tbl;
+------+------+------+
| v1 | v2 | v3 |
+------+------+------+
| 17 | 17 | 17 |
+------+------+------+
-- empty input returns 0
SELECT datasketches_hll_union_agg(sk)
FROM test_datasketches_hll_union_agg_tbl
WHERE sk IS NULL;
+--------------------------------+
| datasketches_hll_union_agg(sk) |
+--------------------------------+
| 0 |
+--------------------------------+
-- invalid sketch bytes will throw
SELECT datasketches_hll_union_agg(from_base64('AA=='));
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[CORRUPTION]HLL sketch data corrupted when add: Attempt to deserialize unknown object type
-- empty string is invalid and will throw
SELECT datasketches_hll_union_agg('');
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[CORRUPTION]HLL sketch data corrupted when add: empty input.