Optimizing Table Schema Design: Table Models, Bucketing Columns, Key Columns, and Field Type Tuning

Table schema design is a critical part of Doris performance tuning. It directly affects data distribution, query parallelism, and sorting efficiency.

An unreasonable schema design often causes the following issues:

Data skew, which prevents query parallelism from being fully utilized
Sort properties become ineffective, slowing down equality and range queries
Inappropriate field types, which raise computation overhead

For more detailed design principles, see the Table Design chapter. This chapter starts from real-world cases to show typical schema design issues and tuning recommendations.

Tuning Checklist

When designing or troubleshooting a table schema, check the following items in order:

Did you choose a table model that matches your business (Duplicate / Unique / Aggregate)?
Are the bucketing columns evenly hashed, with no skew from null or fixed values?
Are the columns frequently used in equality and range queries defined as Key columns?
Do field types follow the principle of "fixed-length first, low-precision first"?

Case 1: Table Model Selection

Doris provides three table models: Duplicate, Unique (MOR/MOW), and Aggregate. Their query performance and feature characteristics differ.

Comparison of the Three Table Models

Table Model	Query Performance	Supports Updates	Typical Scenarios
Duplicate	Highest	No	High-performance queries on logs and detail data
Unique (MOW)	High	Yes	Primary-key deduplication with high query performance requirements
Unique (MOR)	Average	Yes	Primary-key deduplication with frequent writes
Aggregate	Average	Aggregated update	Pre-aggregated reports and metric rollups

Performance ranking: Duplicate > MOW > MOR ≈ Aggregate

Tuning recommendation

When the business has no data update requirements and demands high query performance, prefer the Duplicate table.

Case 2: Bucketing Column Selection

One-sentence definition: The bucketing column determines how data is distributed across buckets. A poor choice causes data skew, which becomes a query performance bottleneck.

A reasonable bucketing column design can:

Prevent data skew and fully utilize parallelism
Maximize the effectiveness of Colocate Join and Bucket Shuffle Join

Bad Example: c2 Column Contains a Large Number of Nulls

In the following example, the bucketing column is set to c2, but c2 is all null in the imported data. As a result, only 1 of the 64 buckets carries all the data:

CREATE TABLE `t1` (
  `c1` INT NULL,
  `c2` INT NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
DISTRIBUTED BY HASH(`c2`) BUCKETS 64
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
insert into t1 select number, null from numbers ('number'='10000000');

Optimization: Switch to a Column with High Cardinality

Change the bucketing column from c2 to c1 so that data is evenly distributed across buckets, improving parallel processing capability.

Command for Troubleshooting Data Skew

Purpose: Confirm whether the bucketing column is skewed. Command:

select c2, count(*) cnt from t1 group by c2 order by cnt desc limit 10;

Explanation: If the cnt of the top values is much larger than the others, the column has serious skew and is not suitable as a bucketing column.

Principles for Bucketing Column Selection

Avoid columns that tend to have null or fixed values in the business
Prefer fields with high business cardinality, such as user ID or order ID
Estimate field value distribution before creating the table, and sample to verify when necessary

Tuning recommendation

Check whether the bucketing column has data skew. If so, replace it with a field with higher cardinality. Up-front design significantly reduces the cost of later diagnosis and correction.

Case 3: Key Column Optimization

One-sentence definition: Doris sorts data by Key columns at the storage layer. Defining frequently queried columns as Key columns can significantly accelerate equality and range queries.

Business Query Examples

select * from t1 where t1.c1 = 1;
select * from t1 where t1.c1 > 1 and t1.c1 < 10;
select * from t1 where t1.c1 in (1, 2, 3);

Optimization: Define c1 as a Key Column

CREATE TABLE `t1` (
  `c1` INT NULL,
  `c2` INT NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
DISTRIBUTED BY HASH(`c2`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

Tuning recommendation

Define columns frequently used in business equality or range queries as Key columns to accelerate query processing.

Case 4: Field Type Optimization

One-sentence definition: Field types directly affect computation complexity. Fixed-length and low-precision types are more efficient to process than variable-length and high-precision types.

Type Selection Principles

Principle	Recommended	Avoid
Fixed-length first	INT, BIGINT, DATE, DATETIME	VARCHAR, STRING
Low-precision first	INT, BIGINT, FLOAT	DECIMAL (in high-precision scenarios)

Common Replacement Scenarios

Use BIGINT to replace VARCHAR / STRING fields used to store numeric values
Use FLOAT / INT / BIGINT to replace unnecessary DECIMAL fields
Use DATETIME to replace string-form time fields

Tuning recommendation

When defining schema types, follow the principle of "fixed-length first, low-precision first" to improve computation efficiency and system performance.

Frequently Asked Questions

Q1: What if I find the bucketing column is unreasonable after creating the table?

The bucketing column cannot be modified directly after the table is created. You need to create a new table and re-import the data, or use ALTER TABLE to create a new Rollup or partitioning scheme. It is recommended to thoroughly evaluate field cardinality before creating the table.

Q2: Are more Key columns always better?

No. Too many Key columns increase storage sorting overhead and write costs. Only set columns that are truly frequently used for equality or range filtering as Key columns.

Q3: When must I use the Unique or Aggregate model?

Need to deduplicate or update data by primary key, use Unique
Need pre-aggregation (SUM, MAX, MIN, etc.), use Aggregate
Append-only detail data with extreme query performance requirements, use Duplicate

Q4: How do I determine whether the current table has data skew?

Run the following SQL to check the bucketing column distribution:

select <bucket_col>, count(*) cnt from <table> group by <bucket_col> order by cnt desc limit 10;

If the count of the top values far exceeds the others, skew exists.

Summary

A carefully designed schema maximizes the use of Doris features and significantly improves query performance. A poor schema may cause global issues such as data skew.

Key tuning points:

Prefer the Duplicate table model (in scenarios with no update requirements)
Choose bucketing columns with high cardinality, avoiding null or fixed values
Define frequently queried columns as Key columns
Follow the principle of "fixed-length first, low-precision first" for field types

Up-front design is always cheaper than after-the-fact tuning. It is recommended to strictly follow the above principles during the schema design phase.

Tuning Checklist​

Case 1: Table Model Selection​

Comparison of the Three Table Models​

Case 2: Bucketing Column Selection​

Bad Example: c2 Column Contains a Large Number of Nulls​

Optimization: Switch to a Column with High Cardinality​

Command for Troubleshooting Data Skew​

Principles for Bucketing Column Selection​

Case 3: Key Column Optimization​

Business Query Examples​

Optimization: Define c1 as a Key Column​

Case 4: Field Type Optimization​

Type Selection Principles​

Common Replacement Scenarios​

Frequently Asked Questions​

Q1: What if I find the bucketing column is unreasonable after creating the table?​

Q2: Are more Key columns always better?​

Q3: When must I use the Unique or Aggregate model?​

Q4: How do I determine whether the current table has data skew?​

Summary​

Tuning Checklist

Case 1: Table Model Selection

Comparison of the Three Table Models

Case 2: Bucketing Column Selection

Bad Example: c2 Column Contains a Large Number of Nulls

Optimization: Switch to a Column with High Cardinality

Command for Troubleshooting Data Skew

Principles for Bucketing Column Selection

Case 3: Key Column Optimization

Business Query Examples

Optimization: Define c1 as a Key Column

Case 4: Field Type Optimization

Type Selection Principles

Common Replacement Scenarios

Frequently Asked Questions

Q1: What if I find the bucketing column is unreasonable after creating the table?

Q2: Are more Key columns always better?

Q3: When must I use the Unique or Aggregate model?

Q4: How do I determine whether the current table has data skew?

Summary