Using AUTO_INCREMENT

tip

Auto-increment column is supported in Doris since version 2.1. The documentation of V2.0 only provides an introduction to its functionality and use cases. If you are interested in this feature, please upgrade to Doris 2.1 and refer to the relevant part in the documentation of V2.1.

For tables that enable an AUTO_INCREMENT Column, Doris assigns a table-unique value to rows that do not have a specified value in the auto-increment column.

Description

For tables that contain an auto-increment column, during data ingestion:

If the auto-increment column is NOT one of the target columns of this import, it will be auto-filled with values generated by Doris. The auto-filled values are unique for each row.
If the auto-increment column is one of the target columns of this import, the imported null values in this column will be replaced with values generated by Doris, while non-null values will remain unchanged. Note that the existence of the imported non-null values might compromise the uniqueness of values in the auto-increment columns.

Uniqueness

Doris ensures that the values generated on the auto-increment column are unique within the table. However, Doris can only guarantee the uniqueness of values generated by itself. If the user inserts values into the auto-increment column explicitly, it does not guarantee the table-wide uniqueness of such values.

Sequentialness

Doris ensures that the values generated on the auto-increment column are dense. However, it does not guarantee that the values populated by auto-increment during a single import will be completely continuous. This is because, for performance reasons, each backend (BE) caches a portion of pre-allocated auto-increment values, and the cached values do not overlap between different BEs. Additionally, due to the existence of caching, Doris cannot guarantee that the auto-increment values generated in a subsequent import will be larger than those generated in a previous import in terms of physical time. Therefore, the order of import cannot be determined based on the magnitude of auto-increment values allocated.

Usage

Create a table in the Duplicate Key model, with one of the key columns being an auto-increment column.

CREATE TABLE `demo`.`tbl` (
      `id` BIGINT NOT NULL AUTO_INCREMENT,
      `value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);

Create a table in the Duplicate Key model, with one of the value columns being an auto-increment column.

CREATE TABLE `demo`.`tbl` (
      `uid` BIGINT NOT NULL,
      `name` BIGINT NOT NULL,
      `id` BIGINT NOT NULL AUTO_INCREMENT,
      `value` BIGINT NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`uid`, `name`)
DISTRIBUTED BY HASH(`uid`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);

Create a table in the Unique Key model, with one of the key columns being an auto-increment column.

CREATE TABLE `demo`.`tbl` (
      `id` BIGINT NOT NULL AUTO_INCREMENT,
      `name` varchar(65533) NOT NULL,
      `value` int(11) NOT NULL
) ENGINE=OLAP
UNIQUE KEY(`id`)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);

Create a table in the Unique Key model, with one of the value columns being an auto-increment column.

CREATE TABLE `demo`.`tbl` (
      `text` varchar(65533) NOT NULL,
      `id` BIGINT NOT NULL AUTO_INCREMENT,
) ENGINE=OLAP
UNIQUE KEY(`text`)
DISTRIBUTED BY HASH(`text`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);

Applicable scenarios

Dictionary encoding

In user profiling, where bitmap is utilized for user group analysis, analysts often need to build a dictionary, in which each user corresponds to a unique integer value. Sequential dictionary values can improve bitmap performance.

Take offline UV (Unique Visitors) and PV (Page Views) analysis as an example, assume that all the detailed data is stored in the following user behavior table:

CREATE TABLE `demo`.`dwd_dup_tbl` (
    `user_id` varchar(50) NOT NULL,
    `dim1` varchar(50) NOT NULL,
    `dim2` varchar(50) NOT NULL,
    `dim3` varchar(50) NOT NULL,
    `dim4` varchar(50) NOT NULL,
    `dim5` varchar(50) NOT NULL,
    `visit_time` DATE NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`user_id`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);

Create a dictionary table using AUTO_INCREMENT:

CREATE TABLE `demo`.`dictionary_tbl` (
    `user_id` varchar(50) NOT NULL,
    `aid` BIGINT NOT NULL AUTO_INCREMENT
) ENGINE=OLAP
UNIQUE KEY(`user_id`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"enable_unique_key_merge_on_write" = "true"
);

Load the existing user_id into the dictionary table, and create mappings from user_id to integer values:

insert into dit_tbl(user_id)
select user_id from dwd_dup_tbl group by user_id;

Or load the incremental user_id into the dictionary table:

insert into dit_tbl(user_id)
select dwd_dup_tbl.user_id from dwd_dup_tbl left join dictionary_tbl
on dwd_dup_tbl.user_id = dictionary_tbl.user_id where dwd_dup_tbl.visit_time  '2023-12-10' and dictionary_tbl.user_id is NULL;

You can also use the Flink-Doris-Connector to load data into Doris.

Suppose that dim1, dim3, dim5 are the analytic dimensions to be considered, create an Aggregate Key table to accommodate the results of data aggregation.

CREATE TABLE `demo`.`dws_agg_tbl` (
    `dim1` varchar(50) NOT NULL,
    `dim3` varchar(50) NOT NULL,
    `dim5` varchar(50) NOT NULL,
    `user_id_bitmap` BITMAP BITMAP_UNION NOT NULL,
    `pv` BIGINT SUM NOT NULL 
) ENGINE=OLAP
AGGREGATE KEY(`dim1`,`dim3`,`dim5`)
DISTRIBUTED BY HASH(`user_id`) BUCKETS 32
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);

Load the aggregated results into the table:

insert into dws_tbl
select dwd_dup_tbl.dim1, dwd_dup_tbl.dim3, dwd_dup_tbl.dim5, BITMAP_UNION(TO_BITMAP(dictionary_tbl.aid)), COUNT(1)
from dwd_dup_tbl INNER JOIN dictionary_tbl on dwd_dup_tbl.user_id = dictionary_tbl.user_id;

Then query PV/UV using the following statement:

select dim1, dim3, dim5, user_id_bitmap as uv, pv from dws_agg_tbl;

Efficient pagination

Pagination is often required in data display. Typically, pagination is implemented by SQL statements like limit or offset + order by. Suppose that you need to display the following table:

CREATE TABLE `demo`.`records_tbl` (
    `key` int(11) NOT NULL COMMENT "",
    `name` varchar(26) NOT NULL COMMENT "",
    `address` varchar(41) NOT NULL COMMENT "",
    `city` varchar(11) NOT NULL COMMENT "",
    `nation` varchar(16) NOT NULL COMMENT "",
    `region` varchar(13) NOT NULL COMMENT "",
    `phone` varchar(16) NOT NULL COMMENT "",
    `mktsegment` varchar(11) NOT NULL COMMENT ""
) DUPLICATE KEY (`key`, `name`)
DISTRIBUTED BY HASH(`key`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 3"
);

In pagination queries, if each page displays 100 results, you can retrieve the first page using the following SQL statement:

select * from records_tbl order by key, name limit 100;

You can retrieve the second page using the following SQL statement:

select * from records_tbl order by key, name limit 100, offset 100;

However, when performing deep pagination queries (with a large offset), even if only a few rows of data are actually requested, the previous method still reads all the data into memory, performs a full sorting, and then proceeds with further processing. For higher execution efficiency, you can assign a unique value to each row of data using an auto-increment column. This method records the maximum value of the unique_value column from the previous page and uses the where unique_value > max_value limit rows_per_page clause to push down the predicate and filter out a significant amount of data in advance, thereby achieving more efficient pagination.

In the above example table, you can add an auto-increment column to the table to assign a unique identifier to each row:

CREATE TABLE `demo`.`records_tbl2` (
    `key` int(11) NOT NULL COMMENT "",
    `name` varchar(26) NOT NULL COMMENT "",
    `address` varchar(41) NOT NULL COMMENT "",
    `city` varchar(11) NOT NULL COMMENT "",
    `nation` varchar(16) NOT NULL COMMENT "",
    `region` varchar(13) NOT NULL COMMENT "",
    `phone` varchar(16) NOT NULL COMMENT "",
    `mktsegment` varchar(11) NOT NULL COMMENT "",
    `unique_value` BIGINT NOT NULL AUTO_INCREMENT
) DUPLICATE KEY (`key`, `name`)
DISTRIBUTED BY HASH(`key`) BUCKETS 10
PROPERTIES (
    "replication_num" = "3"
);

In pagination queries, suppose that each page displays 100 results, you can retrieve the first page by:

select * from records_tbl2 order by unique_value limit 100;

Use programs to record the maximum unique_value in the returned result. Suppose that the maximum is 99, you can query data from the second page by:

select * from records_tbl2 where unique_value  99 order by unique_value limit 100;

If you need to query data from a deeper page, for example, page 101, which means it's hard to get the maximum unique_value from the previous page directly, then you can use the statement as follows:

select key, name, address, city, nation, region, phone, mktsegment
from records_tbl2, (select uniuqe_value as max_value from records_tbl2 order by uniuqe_value limit 1 offset 9999) as previous_data
where records_tbl2.uniuqe_value  previous_data.max_value
order by unique_value limit 100;

Description​

Uniqueness​

Sequentialness​

Usage​

Applicable scenarios​

Dictionary encoding​

Efficient pagination​