Skip to main content

Generating TPC-DS on Hive/Iceberg

Doris supports using the TPCDS Connector to quickly build TPCDS test sets through the Trino Connector compatible framework.

Combined with the data write-back function of Hive/Iceberg tables, you can quickly build TPCDS test data sets for Doris, Hive, and Iceberg tables through Doris.

This document mainly introduces how to deploy and use the TPCDS Connector to build test data sets.

tip

This feature is supported starting from Doris version 3.0.0.

Compile TPCDS Connector​

Requires JDK version 17.

git clone https://github.com/trinodb/trino.git
git checkout 435
cd trino/plugin/trino-tpcds
mvn clean install -DskipTest

After compilation, you will get the trino-tpcds-435/ directory under trino/plugin/trino-tpcds/target/.

You can also directly download the precompiled trino-tpcds-435.tar.gz and extract it.

Deploy TPCDS Connector​

Place the trino-tpcds-435/ directory under the connectors/ directory of all FE and BE deployment paths. (If not present, you can create it manually).

β”œβ”€β”€ bin
β”œβ”€β”€ conf
β”œβ”€β”€ connectors
β”‚Β Β  β”œβ”€β”€ trino-tpcds-435
...

After deployment, it is recommended to restart the FE and BE nodes to ensure the Connector can be loaded correctly.

Create TPCDS Catalog​

CREATE CATALOG `tpcds` PROPERTIES (
"type" = "trino-connector",
"trino.connector.name" = "tpcds",
"trino.tpcds.split-count" = "32"
);

The tpcds.split-count is the concurrency number, which is recommended to be set to 2 times the number of cores per BE machine to achieve optimal concurrency. Improve data generation efficiency.

Use TPCDS Catalog​

The TPCDS Catalog has pre-built TPCDS datasets of different Scale Factors, which can be viewed using the SHOW DATABASES and SHOW TABLES commands.

mysql> SWITCH tpcds;
Query OK, 0 rows affected (0.00 sec)

mysql> SHOW DATABASES;
+--------------------+
| Database |
+--------------------+
| information_schema |
| mysql |
| sf1 |
| sf100 |
| sf1000 |
| sf10000 |
| sf100000 |
| sf300 |
| sf3000 |
| sf30000 |
| tiny |
+--------------------+

mysql> USE sf1;
mysql> SHOW TABLES;
+------------------------+
| Tables_in_sf1 |
+------------------------+
| call_center |
| catalog_page |
| catalog_returns |
| catalog_sales |
| customer |
| customer_address |
| customer_demographics |
| date_dim |
| dbgen_version |
| household_demographics |
| income_band |
| inventory |
| item |
| promotion |
| reason |
| ship_mode |
| store |
| store_returns |
| store_sales |
| time_dim |
| warehouse |
| web_page |
| web_returns |
| web_sales |
| web_site |
+------------------------+

You can directly query these tables using the SELECT statement.

tip

The data of these pre-built datasets is not actually stored but is generated in real-time during queries. Therefore, these pre-built datasets are not suitable for direct Benchmark testing. They are suitable for writing datasets into other target tables (such as Doris internal tables, Hive, Iceberg, and other data sources that Doris supports writing to) through INSERT INTO SELECT, and then performing performance testing on the target tables.

Build TPCDS Test Data Set​

The following example quickly builds a TPCDS test data set on Hive using the CTAS statement:

CREATE TABLE hive.tpcds100.call_center            PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.call_center           ;
CREATE TABLE hive.tpcds100.catalog_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_page ;
CREATE TABLE hive.tpcds100.catalog_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_returns ;
CREATE TABLE hive.tpcds100.catalog_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.catalog_sales ;
CREATE TABLE hive.tpcds100.customer PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer ;
CREATE TABLE hive.tpcds100.customer_address PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_address ;
CREATE TABLE hive.tpcds100.customer_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
CREATE TABLE hive.tpcds100.date_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.date_dim ;
CREATE TABLE hive.tpcds100.dbgen_version PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.dbgen_version ;
CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
CREATE TABLE hive.tpcds100.income_band PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.income_band ;
CREATE TABLE hive.tpcds100.inventory PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.inventory ;
CREATE TABLE hive.tpcds100.item PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.item ;
CREATE TABLE hive.tpcds100.promotion PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.promotion ;
CREATE TABLE hive.tpcds100.reason PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.reason ;
CREATE TABLE hive.tpcds100.ship_mode PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.ship_mode ;
CREATE TABLE hive.tpcds100.store PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store ;
CREATE TABLE hive.tpcds100.store_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_returns ;
CREATE TABLE hive.tpcds100.store_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.store_sales ;
CREATE TABLE hive.tpcds100.time_dim PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.time_dim ;
CREATE TABLE hive.tpcds100.warehouse PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.warehouse ;
CREATE TABLE hive.tpcds100.web_page PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_page ;
CREATE TABLE hive.tpcds100.web_returns PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_returns ;
CREATE TABLE hive.tpcds100.web_sales PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_sales ;
CREATE TABLE hive.tpcds100.web_site PROPERTIES("file_format" = "parquet") AS SELECT * FROM tpcds.sf100.web_site ;
tip

In a Doris cluster with 3 16C BE nodes, creating a TPCDS 1000 Hive dataset takes about 3 to 4 hours.