Hive Bitmap UDF
Hive UDF
Hive Bitmap UDF provides UDFs for generating bitmap and bitmap operations in hive tables. The bitmap in Hive is exactly the same as the Doris bitmap. The bitmap in Hive can be imported into doris through (spark bitmap load).
the main purpose:
- Reduce the time of importing data into doris, and remove processes such as dictionary building and bitmap pre-aggregation;
- Save hive storage, use bitmap to compress data, reduce storage cost;
- Provide flexible bitmap operations in hive, such as: intersection, union, and difference operations, and the calculated bitmap can also be directly imported into doris; imported into doris;
How To Use
Create Bitmap type table in Hive
-- Example: Create Hive Bitmap Table
CREATE TABLE IF NOT EXISTS `hive_bitmap_table`(
`k1` int COMMENT '',
`k2` String COMMENT '',
`k3` String COMMENT '',
`uuid` binary COMMENT 'bitmap'
) comment 'comment'
-- Example:Create Hive Table
CREATE TABLE IF NOT EXISTS `hive_table`(
`k1` int COMMENT '',
`k2` String COMMENT '',
`k3` String COMMENT '',
`uuid` int COMMENT ''
) comment 'comment'
Hive Bitmap UDF Usage:
Hive Bitmap UDF used in Hive/Spark,First, you need to compile fe to get hive-udf-jar-with-dependencies.jar. Compilation preparation:If you have compiled the ldb source code, you can directly compile fe,If you have compiled the ldb source code, you can compile it directly. If you have not compiled the ldb source code, you need to manually install thrift, Reference:Setting Up dev env for FE.
--clone doris code
git clone https://github.com/apache/doris.git
cd doris
git submodule update --init --recursive
--install thrift
--Enter the fe directory
cd fe
--Execute the maven packaging command(All sub modules of fe will be packaged)
mvn package -Dmaven.test.skip=true
--You can also just package the hive-udf module
mvn package -pl hive-udf -am -Dmaven.test.skip=true
After packaging and compiling, enter the hive-udf directory and there will be a target directory,There will be hive-udf.jar package
-- Load the Hive Bitmap Udf jar package (Upload the compiled hive-udf jar package to HDFS)
add jar hdfs://node:9001/hive-udf-jar-with-dependencies.jar;
-- Create Hive Bitmap UDAF function
create temporary function to_bitmap as 'org.apache.doris.udf.ToBitmapUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
create temporary function bitmap_union as 'org.apache.doris.udf.BitmapUnionUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
-- Create Hive Bitmap UDF function
create temporary function bitmap_count as 'org.apache.doris.udf.BitmapCountUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
create temporary function bitmap_and as 'org.apache.doris.udf.BitmapAndUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
create temporary function bitmap_or as 'org.apache.doris.udf.BitmapOrUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
create temporary function bitmap_xor as 'org.apache.doris.udf.BitmapXorUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
-- Example: Generate bitmap by to_bitmap function and write to Hive Bitmap table
insert into hive_bitmap_table
select
k1,
k2,
k3,
to_bitmap(uuid) as uuid
from
hive_table
group by
k1,
k2,
k3
-- Example: The bitmap_count function calculate the number of elements in the bitmap
select k1,k2,k3,bitmap_count(uuid) from hive_bitmap_table
-- Example: The bitmap_union function calculate the grouped bitmap union
select k1,bitmap_union(uuid) from hive_bitmap_table group by k1
Hive Bitmap UDF Description
Hive Bitmap import into Doris
Method 1:Catalog (recommended)
When create a Hive table in the format specified as TEXT, for Binary type, Hive will be saved as a bash64 encoded string. Therefore, the binary data can be directly saved as Bitmap through bitmap_from_base64 function by using Doris's Hive Catalog.
Here is a full example:
- Creating Hive Tables in Hive
CREATE TABLE IF NOT EXISTS `test`.`hive_bitmap_table`(
`k1` int COMMENT '',
`k2` String COMMENT '',
`k3` String COMMENT '',
`uuid` binary COMMENT 'bitmap'
) stored as textfile
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9083'
);
- Create Doris internal table
CREATE TABLE IF NOT EXISTS `test`.`doris_bitmap_table`(
`k1` int COMMENT '',
`k2` String COMMENT '',
`k3` String COMMENT '',
`uuid` BITMAP BITMAP_UNION COMMENT 'bitmap'
)
AGGREGATE KEY(k1, k2, k3)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
- Inserting data from Hive into Doris
insert into doris_bitmap_table select k1, k2, k3, bitmap_from_base64(uuid) from hive.test.hive_bitmap_table;
Method 2: Spark Load
see details: Spark Load -> Basic operation -> Create load(Example 3: when the upstream data source is hive binary type table)