Before You Start the POC
This document highlights common issues that new users may encounter, with the goal of accelerating the POC process.
Table Design
Creating a table in Doris involves four decisions that affect load and query performance.
Data Model
| If your data is... | Use | Why |
|---|---|---|
| Append-only (logs, events, facts) | Duplicate Key (default) | Keeps all rows. Best query performance. |
| Updated by primary key (CDC, upsert) | Unique Key | New rows replace old rows with the same key. |
| Pre-aggregated metrics (PV, UV, sums) | Aggregate Key | Rows are merged with SUM/MAX/MIN at write time. |
Duplicate Key works for most scenarios. See Data Model Overview.
Sort Key
Put the column you filter on most frequently first, with fixed-size types (INT, BIGINT, DATE) before VARCHAR. Doris builds a prefix index on the first 36 bytes of key columns but stops at the first VARCHAR. Add inverted indexes for other columns that need fast filtering.
Partitioning
If you have a time column, use AUTO PARTITION BY RANGE(date_trunc(time_col, 'day')) to enable partition pruning. Doris skips irrelevant partitions automatically.
Bucketing
Default is Random bucketing (recommended for Duplicate Key tables). Use DISTRIBUTED BY HASH(col) if you frequently filter or join on a specific column. See Data Bucketing.
How to choose bucket count:
- Multiple of BE count to ensure even data distribution. When BEs are added later, queries typically scan multiple partitions, so performance holds up.
- As low as possible to avoid small files.
- Compressed data per bucket ≤ 20 GB (≤ 10 GB for Unique Key). Check with
SHOW TABLETS FROM your_table. - No more than 128 per partition. Consider partitioning first if you need more. In extreme cases the upper bound is 1024, but this is rarely needed in production.
Example Templates
Log / Event Analytics
CREATE TABLE app_logs
(
log_time DATETIME NOT NULL,
log_level VARCHAR(10),
service_name VARCHAR(50),
trace_id VARCHAR(64),
message STRING,
INDEX idx_message (message) USING INVERTED PROPERTIES("parser" = "unicode")
)
AUTO PARTITION BY RANGE(date_trunc(`log_time`, 'day'))
()
DISTRIBUTED BY RANDOM BUCKETS 10;
Real-Time Dashboard with Upsert (CDC)
CREATE TABLE user_profiles
(
user_id BIGINT NOT NULL,
username VARCHAR(50),
email VARCHAR(100),
status TINYINT,
updated_at DATETIME
)
UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
Metrics Aggregation
CREATE TABLE site_metrics
(
dt DATE NOT NULL,
site_id INT NOT NULL,
pv BIGINT SUM DEFAULT '0',
uv BIGINT MAX DEFAULT '0'
)
AGGREGATE KEY(dt, site_id)
AUTO PARTITION BY RANGE(date_trunc(`dt`, 'day'))
()
DISTRIBUTED BY HASH(site_id) BUCKETS 10;
Performance Pitfalls
Load
- Don't use
INSERT INTO VALUESfor bulk data. Use Stream Load or Broker Load instead. See Loading Overview. - Batch writes on the client side. High-frequency small imports cause version accumulation. If not feasible, use Group Commit.
- Break large imports into smaller batches. A failed long-running import must restart from scratch. Use INSERT INTO SELECT with S3 TVF for incremental import.
- Enable
load_to_single_tabletfor Duplicate Key tables with Random bucketing to reduce write amplification.
See Load Best Practices.
Query
- Avoid data skew. Check tablet sizes with
SHOW TABLETS. Switch to Random bucketing or a higher-cardinality bucket column if sizes vary significantly. - Don't over-bucket. Too many small tablets create scheduling overhead and can degrade query performance by up to 50%. See Bucketing for sizing guidelines.
- Don't under-bucket. Too few tablets limit CPU parallelism. See Bucketing for sizing guidelines.
- Put the right columns in the sort key. Unlike systems such as PostgreSQL, Doris only indexes the first 36 bytes of key columns and stops at the first VARCHAR. Columns beyond this prefix won't benefit from the sort key. Add inverted indexes for those columns. See Sort Key.
Use Query Profile to diagnose slow queries.