Skip to main content

Async Materialized View Overview

An async materialized view is a query acceleration solution that combines the flexibility of a view with the high performance of a physical table. By precomputing and storing query results, subsequent queries can hit the materialized result set directly, avoiding the overhead of repeatedly executing complex SQL.

Quick Tour

Before using async materialized views, confirm the following points:

  • Does the business need query acceleration, ETL simplification, lakehouse external table acceleration, or write optimization?
  • Does the query follow the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern?
  • Can the data tolerate eventual consistency (non-real-time synchronization)?
  • Is the Catalog of the base table within the supported scope (Internal/Hive/Iceberg/Paimon/Hudi/JDBC/ES)?
  • Is partitioned incremental refresh needed to lower refresh costs?

Use Cases

Async materialized views mainly target the following four scenarios:

ScenarioValue
Query acceleration and concurrency improvementSignificantly improves query speed, enhances concurrency, and reduces resource consumption
Simplified ETL workflowSimplifies the extract, transform, and load (ETL) process and improves development efficiency
Accelerated lakehouse external table queriesSignificantly improves query speed against external data sources in a lakehouse architecture
Improved write efficiencyReduces resource contention, optimizes the data write process, and ensures consistency and integrity

Limitations

Before using async materialized views, be aware of the following limitations:

  • Data consistency: An async materialized view is eventually consistent with the base table, but it cannot synchronize in real time and does not guarantee real-time consistency.
  • Window functions: Currently, queries that contain window functions are not supported for transparent rewriting into a form based on materialized views.
  • Multi-table scenarios: If the materialized view joins more tables than the query involves (for example, the query only involves t1 and t2, while the materialized view contains t1, t2, and t3), transparent rewriting is currently not supported.
  • Operators not supported by transparent rewriting: If the materialized view contains UNION ALL, LIMIT, ORDER BY, or CROSS JOIN, the materialized view can still be built normally, but it cannot be used for transparent rewriting.
  • Data type restrictions: Creating a materialized view does not currently support the VARBINARY type.

Principles

An async materialized view is essentially an internal table of type MTMV. When the materialized view is created, the system also registers a refresh task. When the task runs, it writes the latest data into the materialized view through an INSERT OVERWRITE statement.

Refresh Mechanism

Unlike the real-time incremental refresh of synchronous materialized views, async materialized views provide more flexible refresh options:

Refresh ModeDescriptionApplicable Scenarios
Full refreshRecomputes all data referenced by the materialized view definition SQL and writes it into the materialized view in fullSmall data volume or when overall consistency must be guaranteed
Partitioned incremental refreshIntelligently identifies changes in base table partition data and refreshes only the affected partitionsLarge data volume with changes concentrated in specific partitions
  • Full refresh: Ensures that the materialized view data is fully consistent with the base table, but may consume more compute resources and time.
  • Partitioned incremental refresh: Significantly reduces the resources and time required for refresh, while ensuring eventual consistency.

Transparent Rewriting

Transparent rewriting is an important means by which a database optimizes query performance. When processing a user query, the system automatically rewrites the SQL to hit a suitable materialized view, thereby improving execution efficiency and reducing computation costs. The entire process is transparent to the user and requires no intervention.

Doris async materialized views use a transparent rewriting algorithm based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern:

  1. Performs in-depth analysis of the structural information of the SQL.
  2. Automatically finds and selects a suitable materialized view for transparent rewriting.
  3. When multiple candidate materialized views exist, selects the optimal materialized view to respond to the query based on strategies such as a cost model, further improving performance.

Creating an Async Materialized View on a Data Lake

The syntax for creating an async materialized view on a data lake is identical to creating one on an internal table, but note the following:

  • Metadata source: The partition version and other information required for materialized view refresh come from the metadata cache of the data lake, not directly from the external environment. Therefore, after the refresh completes, the data is consistent with the result of querying the data lake through Doris, but it may be inconsistent with the results from other engines, depending on how the cache has been refreshed.
  • External change awareness: If the underlying Hive data is changed by an external process not controlled by Doris (such as a Spark, Hive, or Flink job) but the metadata does not change (for example, when insert overwrite is executed), the materialized view will incorrectly consider itself consistent with the base table, but the query results will be inconsistent with the results of querying the data lake through Doris. This issue can be resolved by manually forcing a refresh of the materialized view.
  • Iceberg limitations: When creating a partitioned materialized view on Iceberg, only Iceberg tables with a single partition column are supported, and partition evolution is supported to a limited extent. For example, changes to the time range of a time-type partition are supported; if the partition field itself changes, the materialized view refresh will fail.
  • Hudi limitations: When creating a materialized view on Hudi, the system cannot detect whether the base table data has changed. As long as the materialized view (or some of its partitions) has been refreshed, it is considered synchronized with the base table. Therefore, creating a materialized view on Hudi is only suitable for manual on-demand refresh scenarios.

Materialized View Refresh Support on Data Lakes

The following table shows the level of refresh support for different table types and Catalogs:

Table Type Catalog Type Refresh Method Refresh Trigger
Full Refresh Partitioned Refresh Auto Trigger
Internal Table Internal Supported in 2.1 Supported in 2.1 Supported in 2.1.4
Hive Hive Supported in 2.1 Supported in 2.1 Not supported
Iceberg Iceberg Supported in 2.1 Supported in 3.1 Not supported
Paimon Paimon Supported in 2.1 Supported in 3.1 Not supported
Hudi Hudi Supported in 2.1 Supported in 3.1 Not supported
JDBC JDBC Supported in 2.1 Not supported Not supported
ES ES Supported in 2.1 Not supported Not supported

Transparent Rewriting Support on Data Lakes

The transparent rewriting feature of async materialized views currently supports the following table types and Catalogs.

Real-time awareness of base table data: This means that when the base table data used by the materialized view changes, the materialized view can detect the change in real time and use the latest data in queries.

Table Type Catalog Type Transparent Rewriting Support Real-time Awareness of Base Table Data
Internal Table Internal Supported Supported
Hive Hive Supported Supported in 3.1
Iceberg Iceberg Supported Supported in 3.1
Paimon Paimon Supported Supported in 3.1
Hudi Hudi Supported Not supported
JDBC JDBC Supported Not supported
ES ES Supported Not supported

Enabling Transparent Rewriting for External Table Materialized Views

When a materialized view uses external tables, it does not participate in transparent rewriting by default. To enable it, run:

SET materialized_view_rewrite_enable_contain_external_table = true;

Performance Optimization for External Table Transparent Rewriting

Starting from 2.1.11, Doris has optimized the transparent rewriting performance for external tables, mainly by improving the performance of obtaining available materialized views that contain external tables.

If transparent rewriting on a partitioned materialized view that contains external tables is slow, you can adjust the following parameters in fe.conf:

ParameterDefaultDescriptionVersion
max_hive_partition_cache_num10000The maximum number of table-level partition caches in Hive Metastore. When a Hive external table has many partitions, set it to 20000 or higher-
external_cache_expire_time_minutes_after_access10 minutesThe expiration time of a cached object after its last access. It can be increased appropriately (applies to both external table schema cache and Hive metadata cache)-
external_cache_refresh_time_minutes10 minutesThe auto-refresh interval of external table metadata cache objects. It can be increased appropriately3.1+

For more configuration details on external table metadata cache, see Metadata Cache.

Relationship Between Materialized Views and OLAP Internal Tables

There is no restriction on the base table model used in the SQL definition of an async materialized view. It can be a Duplicate model, a Unique model (merge-on-write or merge-on-read), an Aggregate model, and so on.

The underlying implementation of the materialized view itself relies on an OLAP table of the Duplicate model and can in theory support all core features of the Duplicate model. However, to ensure that the materialized view can stably and efficiently execute data refresh tasks, a series of necessary restrictions are placed on its functionality:

  • Partition operations: The partitions of a materialized view are automatically created and maintained based on its base tables. Users cannot perform partition operations on a materialized view.
  • Drop and rename: Because there are associated jobs (JOB) behind a materialized view, the DELETE TABLE or RENAME TABLE commands cannot be used to operate on a materialized view. Use the materialized view's own commands to perform the corresponding operations.
  • Column data types: The column data types of a materialized view are automatically derived from the query statement specified at creation time and cannot be modified, otherwise the refresh task may fail.
  • Property modification: A materialized view has some properties that a Duplicate table does not have. These must be modified through the materialized view's commands. Other common properties are modified using the ALTER TABLE command.

FAQ

Q1: Can an async materialized view guarantee real-time data consistency with the base table?

No. An async materialized view is eventually consistent with the base table, but it cannot synchronize in real time. If real-time consistency is required, consider a synchronous materialized view.

Q2: Why can a query that contains window functions not be rewritten?

The current transparent rewriting algorithm does not yet support window functions. Queries that contain window functions are not rewritten into a form based on materialized views.

Q3: Why is a materialized view based on an external table not hit by rewriting?

External-table-based materialized views do not participate in transparent rewriting by default. Set materialized_view_rewrite_enable_contain_external_table = true to make them participate in rewriting.

Q4: Why are the materialized view query results inconsistent after Hive external table data is modified by an external process?

If Hive data is modified by an external process such as Spark, Hive, or Flink but the metadata does not change (for example, insert overwrite), the materialized view cannot detect the change. This can be resolved by manually forcing a refresh of the materialized view.

Q5: Can a materialized view based on Hudi refresh automatically?

It cannot detect changes in the base table and is only suitable for manual on-demand refresh scenarios.

See Also