HLL Approximate Deduplication
HLL (HyperLogLog) is an approximate deduplication solution based on a probabilistic algorithm. Within a 1%-2% error range, it computes the cardinality (Distinct Count) of massive datasets with very low space and time overhead.
Use Cases
As business data volumes keep growing, the computation and storage costs of exact deduplication rise sharply. Once data reaches a certain scale, exact deduplication is no longer economical.
The HLL algorithm has the following characteristics:
| Dimension | Behavior |
|---|---|
| Space complexity | O(mloglogn) |
| Time complexity | O(n) |
| Error range | About 1%-2% (depends on the dataset and hash function) |
When the business can tolerate the error, using HLL for approximate deduplication is an effective way to reduce computational pressure and accelerate queries.
- Who it suits: scenarios with huge data volumes, tolerance for 1%-2% error, and a need to balance query performance and storage cost.
- How to use it: when creating a table, set the column type to
HLLand the aggregation function toHLL_UNION; during data loading, generate HLL values withHLL_HASH(); at query time, aggregate withHLL_UNION_AGG(). - Common uses: UV statistics, user deduplication, unique device count estimation.
What Is HyperLogLog
HyperLogLog is an upgraded version of the LogLog algorithm. It provides imprecise cardinality (distinct count) estimation, and its mathematical foundation is the Bernoulli trial.
An Intuitive Explanation of the Bernoulli Trial
- Assume a coin has two sides, and a single toss has a 50% probability of landing on either side.
- Keep tossing the coin until heads appears, and record this as one complete trial.
- Repeat for n trials, meaning heads has appeared n times. Let k denote the number of tosses in each trial, and let k_i denote the i-th trial.
- Across the n trials, there must be a maximum number of tosses, denoted as k_max (for example, in one trial it took 12 tosses for heads to appear).
Conclusions Derived
- The number of tosses in each of the n Bernoulli trials is no greater than k_max.
- At least one of the n Bernoulli trials has a toss count equal to k_max.
Combined with maximum likelihood estimation, the relationship between n and k_max can be estimated as: n = 2 ^ k_max.
In other words, as long as you record k_max, you can estimate the total number of records, that is, the cardinality. This is the core idea of the HLL algorithm.
Using HLL for Approximate Deduplication
Step 1: Create the table
Goal: Create an aggregate table that supports HLL deduplication.
Constraints:
- The target column type must be set to
HLL, and the aggregation function must be set toHLL_UNION. - An HLL-type column cannot be used as a Key column.
- You do not need to specify the length or default value. The length of an HLL column is internally controlled by the system based on the degree of data aggregation.
Example:
create table test_hll(
dt date,
id int,
name char(10),
province char(10),
os char(10),
uv hll hll_union
)
Aggregate KEY (dt,id,name,province,os)
distributed by hash(id) buckets 10
PROPERTIES(
"replication_num" = "1",
"in_memory"="false"
);
Step 2: Load data
Goal: Convert raw detail data to an HLL column with HLL_HASH() and write it into the table.
Sample data (test_hll.csv):
2022-05-05,10001,test01,Beijing,windows
2022-05-05,10002,test01,Beijing,linux
2022-05-05,10003,test01,Beijing,macos
2022-05-05,10004,test01,Hebei,windows
2022-05-06,10001,test01,Shanghai,windows
2022-05-06,10002,test01,Shanghai,linux
2022-05-06,10003,test01,Jiangsu,macos
2022-05-06,10004,test01,Shaanxi,windows
Stream Load command
curl --location-trusted -u root: -H "label:label_test_hll_load" \
-H "column_separator:," \
-H "columns:dt,id,name,province,os,uv=hll_hash(id)" -T test_hll.csv http://fe_IP:8030/api/demo/test_hll/_stream_load
Sample load result
# curl --location-trusted -u root: -H "label:label_test_hll_load" -H "column_separator:," -H "columns:dt,id,name,province,os, pv=hll_hash(id)" -T test_hll.csv http://127.0.0.1:8030/api/demo/test_hll/_stream_load
{
"TxnId": 693,
"Label": "label_test_hll_load",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 8,
"NumberLoadedRows": 8,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 320,
"LoadTimeMs": 23,
"BeginTxnTimeMs": 0,
"StreamLoadPutTimeMs": 1,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 9,
"CommitAndPublishTimeMs": 11
}
Step 3: Query data
Goal: Get approximate deduplication results through HLL aggregation functions.
Note: HLL columns do not allow direct querying of raw values. You can only query them through HLL aggregation functions.
Total UV
mysql> select HLL_UNION_AGG(uv) from test_hll;
+---------------------+
| hll_union_agg(`uv`) |
+---------------------+
| 4 |
+---------------------+
1 row in set (0.00 sec)
Equivalent form:
mysql> SELECT COUNT(DISTINCT uv) FROM test_hll;
+----------------------+
| count(DISTINCT `uv`) |
+----------------------+
| 4 |
+----------------------+
1 row in set (0.01 sec)
UV per day
mysql> select HLL_UNION_AGG(uv) from test_hll group by dt;
+---------------------+
| hll_union_agg(`uv`) |
+---------------------+
| 4 |
| 4 |
+---------------------+
2 rows in set (0.01 sec)
Related Functions
| Function | Purpose |
|---|---|
HLL_UNION_AGG(hll) | Aggregation function used to compute the cardinality estimate of all matching data |
HLL_CARDINALITY(hll) | Computes the cardinality estimate of a single HLL column value |
HLL_HASH(column_name) | Generates an HLL column type, used during Insert or data loading (loading usage above) |
HLL_EMPTY() | Generates an empty HLL column, used to fill default values during insert or data loading |