DATASKETCHES_HLL_UNION_AGG
Supported since version 4.1.2.
Description
datasketches_hll_union_agg is an aggregate function used to union multiple Apache DataSketches HLL (hll_sketch) serialized values and return the estimated cardinality (approximate distinct count / NDV) after union.
This function expects the input to be serialized bytes of a DataSketches HLL sketch (for example, generated by hll_sketch.serialize_compact() in the DataSketches library). It does not accept arbitrary strings.
Alias
ds_hll_estimatedatasketches_hll_estimate
Syntax
datasketches_hll_union_agg(<sketch>)
Parameters
| Parameter | Description |
|---|---|
<sketch> | The serialized bytes of an Apache DataSketches HLL sketch. Supported types: STRING / VARCHAR / VARBINARY. NULL values are ignored. Empty strings are treated as invalid input and will throw an error. |
Return Value
Returns a DOUBLE (Float64) cardinality estimate value.
If there is no valid data in the group (or the input is empty), returns 0.
If the input bytes cannot be deserialized as a valid DataSketches HLL sketch (including empty string), an error is thrown (typically with error code CORRUPTION).
Example
-- setup
CREATE TABLE test_datasketches_hll_union_agg_tbl (
id INT,
sk STRING
)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_num" = "1");
-- The sketch bytes are inserted via Base64 decoding.
INSERT INTO test_datasketches_hll_union_agg_tbl VALUES
(1, from_base64('AgEHCAMIBwjL18IEK/L7BoYv+Q11gWYHgbxdBntl5gj8LUIK')),
(2, from_base64('AwEHCAUIAAkKAAAAIjvrBcS1nwfGGWoEyHokBO8t9wc1qTEENkcJB7hWqQxZf9QNnuSbGA==')),
(3, NULL);
-- The function returns DOUBLE, so use ROUND/CAST if you want an integer display.
SELECT CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT)
FROM test_datasketches_hll_union_agg_tbl;
+-------------------------------------------------------+
| CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT) |
+-------------------------------------------------------+
| 17 |
+-------------------------------------------------------+
-- aliases
SELECT
CAST(ROUND(datasketches_hll_union_agg(sk)) AS BIGINT) AS v1,
CAST(ROUND(ds_hll_estimate(sk)) AS BIGINT) AS v2,
CAST(ROUND(datasketches_hll_estimate(sk)) AS BIGINT) AS v3
FROM test_datasketches_hll_union_agg_tbl;
+------+------+------+
| v1 | v2 | v3 |
+------+------+------+
| 17 | 17 | 17 |
+------+------+------+
-- empty input returns 0
SELECT datasketches_hll_union_agg(sk)
FROM test_datasketches_hll_union_agg_tbl
WHERE sk IS NULL;
+--------------------------------+
| datasketches_hll_union_agg(sk) |
+--------------------------------+
| 0 |
+--------------------------------+
-- invalid sketch bytes will throw
SELECT datasketches_hll_union_agg(from_base64('AA=='));
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[CORRUPTION]HLL sketch data corrupted when add: Attempt to deserialize unknown object type
-- empty string is invalid and will throw
SELECT datasketches_hll_union_agg('');
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[CORRUPTION]HLL sketch data corrupted when add: empty input.