Generating TPC-DS on Hive/Iceberg
Doris supports using the TPCDS Connector to quickly build TPCDS test sets through the Trino Connector compatible framework.
Combined with the data write-back function of Hive/Iceberg tables, you can quickly build TPCDS test data sets for Doris, Hive, and Iceberg tables through Doris.
This document mainly introduces how to deploy and use the TPCDS Connector to build test data sets.
This feature is supported starting from Doris version 3.0.0.
Compile TPCDS Connectorβ
Requires JDK version 17.
git clone https://github.com/trinodb/trino.git
git checkout 435
cd trino/plugin/trino-tpcds
mvn clean install -DskipTest
After compilation, you will get the trino-tpcds-435/
directory under trino/plugin/trino-tpcds/target/
.
You can also directly download the precompiled trino-tpcds-435.tar.gz and extract it.
Deploy TPCDS Connectorβ
Place the trino-tpcds-435/
directory under the connectors/
directory of all FE and BE deployment paths. (If not present, you can create it manually).
βββ bin
βββ conf
βββ connectors
βΒ Β βββ trino-tpcds-435
...
After deployment, it is recommended to restart the FE and BE nodes to ensure the Connector can be loaded correctly.
Create TPCDS Catalogβ
CREATE CATALOG `tpcds` PROPERTIES (
"type" = "trino-connector",
"trino.connector.name" = "tpcds",
"trino.tpcds.split-count" = "32"
);
The tpcds.split-count
is the concurrency number, which is recommended to be set to 2 times the number of cores per BE machine to achieve optimal concurrency. Improve data generation efficiency.
Use TPCDS Catalogβ
The TPCDS Catalog has pre-built TPCDS datasets of different Scale Factors, which can be viewed using the SHOW DATABASES
and SHOW TABLES
commands.
mysql> SWITCH tpcds;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| sf1 |
| sf100 |
| sf1000 |
| sf10000 |
| sf100000 |
| sf300 |
| sf3000 |
| sf30000 |
| tiny |
+--------------------+
mysql> USE sf1;
mysql> SHOW TABLES;
+------------------------+
| Tables_in_sf1 |
+------------------------+
| call_center |
| catalog_page |
| catalog_returns |
| catalog_sales |
| customer |
| customer_address |
| customer_demographics |
| date_dim |
| dbgen_version |
| household_demographics |
| income_band |
| inventory |
| item |
| promotion |
| reason |
| ship_mode |
| store |
| store_returns |
| store_sales |
| time_dim |
| warehouse |
| web_page |
| web_returns |
| web_sales |
| web_site |
+------------------------+
You can directly query these tables using the SELECT statement.
The data of these pre-built datasets is not actually stored but is generated in real-time during queries. Therefore, these pre-built datasets are not suitable for direct Benchmark testing. They are suitable for writing datasets into other target tables (such as Doris internal tables, Hive, Iceberg, and other data sources that Doris supports writing to) through INSERT INTO SELECT
, and then performing performance testing on the target tables.
Build TPCDS Test Data Setβ
The following example quickly builds a TPCDS test data set on Hive using the CTAS statement:
CREATE TABLE hive.tpcds100.call_center PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.call_center ;
CREATE TABLE hive.tpcds100.catalog_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_page ;
CREATE TABLE hive.tpcds100.catalog_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_returns ;
CREATE TABLE hive.tpcds100.catalog_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_sales ;
CREATE TABLE hive.tpcds100.customer PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer ;
CREATE TABLE hive.tpcds100.customer_address PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_address ;
CREATE TABLE hive.tpcds100.customer_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
CREATE TABLE hive.tpcds100.date_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.date_dim ;
CREATE TABLE hive.tpcds100.dbgen_version PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.dbgen_version ;
CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
CREATE TABLE hive.tpcds100.income_band PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.income_band ;
CREATE TABLE hive.tpcds100.inventory PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.inventory ;
CREATE TABLE hive.tpcds100.item PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.item ;
CREATE TABLE hive.tpcds100.promotion PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.promotion ;
CREATE TABLE hive.tpcds100.reason PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.reason ;
CREATE TABLE hive.tpcds100.ship_mode PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.ship_mode ;
CREATE TABLE hive.tpcds100.store PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store ;
CREATE TABLE hive.tpcds100.store_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_returns ;
CREATE TABLE hive.tpcds100.store_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_sales ;
CREATE TABLE hive.tpcds100.time_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.time_dim ;
CREATE TABLE hive.tpcds100.warehouse PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.warehouse ;
CREATE TABLE hive.tpcds100.web_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_page ;
CREATE TABLE hive.tpcds100.web_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_returns ;
CREATE TABLE hive.tpcds100.web_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_sales ;
CREATE TABLE hive.tpcds100.web_site PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_site ;
In a Doris cluster with 3 16C BE nodes, creating a TPCDS 1000 Hive dataset takes about 3 to 4 hours.