Workload Group

Workload Group is an in-process mechanism for isolating workloads. It achieves resource isolation by finely partitioning or limiting resources (CPU, IO, Memory) within the BE process. Its principle is illustrated in the diagram below:

workload_group

The currently supported isolation capabilities include:

Managing CPU resources, with support for both cpu hard limit and cpu soft limit;
Managing memory resources, with support for both memory hard limit and memory soft limit;
Managing IO resources, including IO generated by reading local and remote files.

tip

Workload Group provides in-process resource isolation capabilities, which differ from inter-process resource isolation methods (such as Resource Group and Compute Group) in the following ways:

In-process resource isolation cannot achieve complete isolation. For example, when high-load and low-load queries run within the same process, even if CPU usage for the high-load group is restricted using Workload Group to keep overall CPU usage within a reasonable range, the latency of the low-load group may still be affected. However, it will perform better compared to having no CPU control at all. This limitation arises because certain shared components within the process, such as common caches and shared RPC thread pools, are difficult to isolate entirely.
The choice of a resource isolation strategy depends on the trade-off between isolation and cost. If some degree of latency can be tolerated while prioritizing lower costs, the Workload Group isolation approach may be suitable. On the other hand, if complete isolation is required and higher costs are acceptable, an inter-process resource isolation approach (i.e., placing isolated workloads in separate processes) should be considered. For example, using Resource Group or Compute Group to allocate high-priority workloads to independent BE nodes can achieve a more thorough isolation.

Version Notes

The Workload Group feature has been available since Doris 2.0. In Doris 2.0, the Workload Group feature does not rely on CGroup, but starting with Doris 2.1, it requires CGroup.
Upgrading from Doris 1.2 to 2.0: It is recommended to enable the Workload Group feature only after the entire cluster has been upgraded. If only some follower FE nodes are upgraded, queries on the upgraded follower FE nodes may fail due to the absence of Workload Group metadata on the non-upgraded FE nodes.
Upgrading from Doris 2.0 to 2.1: Since the Workload Group feature in Doris 2.1 relies on CGroup, you need to configure the CGroup environment before upgrading to Doris 2.1.

Configuring Workload Group

Setting Up the CGroup Environment

Workload Group supports managing CPU, memory, and IO. CPU management relies on the CGroup component. To use Workload Group for CPU resource management, you must first configure the CGroup environment.

The following are the steps for configuring the CGroup environment:

First, verify whether the node where the BE is located has CGroup installed. If the output includes cgroup, it indicates that CGroup V1 is installed in the current environment. If it includes cgroup2, it indicates that CGroup V2 is installed. You can determine which version is active in the next step.

cat /proc/filesystems | grep cgroup
nodev	cgroup
nodev	cgroup2
nodev	cgroupfs

The active CGroup version can be confirmed based on the path name.

If this path exists, it indicates that CGroup V1 is currently active.
/sys/fs/cgroup/cpu/


If this path exists, it indicates that CGroup V2 is currently active.
/sys/fs/cgroup/cgroup.controllers

Create a directory named doris under the CGroup path. The directory name can be customized by the user.

If using CGroup V1, create the directory under the cpu directory.
mkdir /sys/fs/cgroup/cpu/doris


If using CGroup V2, create the directory directly under the cgroup directory.
mkdir /sys/fs/cgroup/doris

Ensure that the Doris BE process has read, write, and execute permissions for this directory.

// If using CGroup V1, the command is as follows:
// 1. Modify the directory's permissions to be readable, writable, and executable.
chmod 770 /sys/fs/cgroup/cpu/doris
// 2. Change the ownership of this directory to the doris account.
chown -R doris:doris /sys/fs/cgroup/cpu/doris


// If using CGroup V2, the command is as follows:
// 1.Modify the directory's permissions to be readable, writable, and executable.
chmod 770 /sys/fs/cgroup/doris
// 2. Change the ownership of this directory to the doris account.
chown -R doris:doris /sys/fs/cgroup/doris

If the current environment is using CGroup v2, the following steps are required. If it is CGroup v1, this step can be skipped.

Modify the permissions of the cgroup.procs file in the root directory. This is necessary because CGroup v2 has stricter permission controls, and write permissions to the cgroup.procs file in the root directory are required to move processes between CGroup directories.

chmod a+w /sys/fs/cgroup/cgroup.procs

In CGroup v2, the cgroup.controllers file lists the available controllers for the current directory, and the cgroup.subtree_control file lists the controllers available for the subdirectories. Therefore, it is necessary to check if the doris directory has the cpu controller enabled. If the cgroup.controllers file in the doris directory does not include cpu, it means the cpu controller is not enabled. You can enable it by executing the following command in the doris directory. This command works by modifying the cgroup.subtree_control file in the parent directory to allow the doris directory to use the cpu controller.

// After running this command, you should be able to see the cpu.max file in the doris directory, 
// and the output of cgroup.controllers should include cpu.
// If the command fails, it means that the parent directory of doris also does not have the cpu controller enabled, 
// and you will need to enable the cpu controller for the parent directory.
echo +cpu > ../cgroup.subtree_control

Modify the BE configuration to specify the path of the cgroup.

If using CGroup V1, the configuration path is as follows:
doris_cgroup_cpu_path = /sys/fs/cgroup/cpu/doris

If using CGroup V2, the configuration path is as follows:
doris_cgroup_cpu_path = /sys/fs/cgroup/doris

Restart the BE, and in the log (be.INFO), the phrase 'add thread xxx to group' indicates that the configuration was successful.

tip

It is recommended to deploy only one BE per machine, as the current Workload Group feature does not support deploying multiple BE instances on a single machine.
After a machine is restarted, all configurations under the CGroup path will be cleared. To persist the CGroup configuration, you can use systemd to set the operation as a custom system service, so that the creation and authorization operations can be automatically performed each time the machine restarts.
If using CGroup within a container, the container must have permission to operate on the host machine.

Considerations for Using Workload Group in Containers

Workload's CPU management is based on CGroup. If you want to use Workload Group inside a container, the container needs to be started in privileged mode so that the BE process inside the container has permission to read and write CGroup files on the host machine.

When BE runs inside a container, the CPU resource usage for Workload Group is partitioned based on the available resources of the container. For example, if the host machine has 64 cores and the container is allocated 8 cores, and the Workload Group is configured with a 50% CPU hard limit, the actual available CPU cores for the Workload Group will be 4 (8 cores * 50%).

The memory and IO management functions of Workload Group are implemented internally by Doris and do not rely on external components, so there is no difference in deployment between containers and physical machines.

If you want to use Doris on K8S, it is recommended to deploy it using the Doris Operator, which can shield underlying permission issues.

Create Workload Group

mysql [information_schema]>create workload group if not exists g1
    -> properties (
    ->     "cpu_share"="1024"
    -> );
Query OK, 0 rows affected (0.03 sec)

You can refer to CREATE-WORKLOAD-GROUP。

The CPU limit configured at this point is a soft limit. Since version 2.1, Doris will automatically create a group named normal, which cannot be deleted.

Workload Group Properties

Property	Data type	Default value	Value range	Description
cpu_share	Integer	-1	[1, 10000]	Optional, effective under CPU soft limit mode. The valid range of values depends on the version of CGroup being used, which is described in detail later. cpu_share represents the weight of CPU time that the Workload Group can acquire; the larger the value, the more CPU time it can obtain. For example, if the user creates three Workload Groups, g-a, g-b, and g-c, with cpu_share values of 10, 30, and 40 respectively, and at a certain point g-a and g-b are running tasks while g-c has no tasks, g-a will receive 25% (10 / (10 + 30)) of the CPU resources, and g-b will receive 75% of the CPU resources. If only one Workload Group is running in the system, regardless of the cpu_share value, it will be able to acquire all the CPU resources.
memory_limit	Float	-1	(0%, 100%]	Optional. Enabling memory hard limit represents the maximum available memory percentage for the current Workload Group. The default value means no memory limit is applied. The cumulative value of memory_limit for all Workload Groups cannot exceed 100%, and it is typically used in conjunction with the enable_memory_overcommit attribute. For example, if a machine has 64GB of memory and the memory_limit for a Workload Group is set to 50%, the actual physical memory available for that group would be 64GB * 90% * 50% = 28.8GB, where 90% is the default value for the available memory configuration of the BE process.
enable_memory_overcommit	Boolean	true	true, false	Optional. Used to control whether the memory limit for the current Workload Group is a hard limit or a soft limit, with the default set to true. If set to false, the Workload Group will have hard memory limit, and when the system detects that the memory usage exceeds the limit, it will immediately cancel the tasks with the highest memory usage within the group to release the excess memory. If set to true, the Workload Group will have soft memory limit. If there is free memory available, the Workload Group can continue using system memory even after exceeding the memory_limit. When the system's total memory is under pressure, the system will cancel the tasks with the highest memory usage within the group and release some of the excess memory to alleviate system memory pressure. It is recommended that the total memory_limit of all Workload Groups be kept below 100% to reserve memory for other components of the BE process.
cpu_hard_limit	Integer	-1	[1%, 100%]	Optional. Effective under CPU hard limit mode, it represents the maximum CPU percentage a Workload Group can use. Regardless of whether the machine's CPU resources are fully utilized, the Workload Group's CPU usage cannot exceed the cpu_hard_limit. The cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%. This attribute was introduced in version 2.1 and is not supported in version 2.0.
max_concurrency	Integer	2147483647	[0, 2147483647]	Optional. Specifies the maximum query concurrency. The default value is the maximum value of an integer, meaning no concurrency limit. When the number of running queries reaches the maximum concurrency, new queries will enter a queue.
max_queue_size	Integer	0	[0, 2147483647]	Optional. Specifies the length of the query waiting queue. When the queue is full, new queries will be rejected. The default value is 0, which means no queuing. If the queue is full, new queries will fail directly.
queue_timeout	Integer	0	[0, 2147483647]	Optional. Specifies the maximum waiting time for a query in the waiting queue, in milliseconds. If the query's waiting time in the queue exceeds this value, an exception will be thrown directly to the client. The default value is 0, meaning no queuing; queries will immediately fail upon entering the queue.
scan_thread_num	Integer	-1	[1, 2147483647]	Optional. Specifies the number of threads used for scanning in the current Workload Group. When this property is set to -1, it means it is not active, and the actual scan thread num on the BE will default to the doris_scanner_thread_pool_thread_num configuration in the BE.
max_remote_scan_thread_num	Integer	-1	[1, 2147483647]	Optional. Specifies the maximum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores.
min_remote_scan_thread_num	Integer	-1	[1, 2147483647]	Optional. Specifies the minimum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores.
tag	String	empty	-	This feature has been discontinued and is not recommended for use in production environments; Specifies tags for the Workload Group. The cumulative resource values of Workload Groups with the same tag cannot exceed 100%. To specify multiple values, use commas to separate them.
read_bytes_per_second	Integer	-1	[1, 9223372036854775807]	Optional. Specifies the maximum I/O throughput when reading internal tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied. It is important to note that this value is not tied to individual disks but to directories. For example, if Doris is configured with two directories to store internal table data, the maximum read I/O for each directory will not exceed this value. If both directories are placed on the same disk, the maximum throughput will be doubled (i.e., 2 times read_bytes_per_second). The file directory for spill disk is also subject to this limit.
remote_read_bytes_per_second	Integer	-1	[1, 9223372036854775807]	Optional. Specifies the maximum I/O throughput when reading external tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied.

tip

Currently, the simultaneous use of both cpu hard limit and cpu soft limit is not supported. At any given time, a cluster can only have either a soft limit or a hard limit. The method for switching between them will be described later.
All properties are optional, but at least one property must be specified when creating a Workload Group.
It is important to note that the default values for CPU soft limits differ between CGroup v1 and CGroup v2. The default CPU soft limit for CGroup v1 is 1024, with a valid range from 2 to 262144, while the default for CGroup v2 is 100, with a valid range from 1 to 10000. If a value outside the range is set for the soft limit, it may cause the CPU soft limit modification to fail in BE. If the default value of 100 from CGroup v2 is applied in a CGroup v1 environment, it could result in this Workload Group having the lowest priority on the machine.

Set Workload Group for user

Before binding a user to a specific Workload Group, it is necessary to ensure that the user has the necessary permissions for the Workload Group. You can use the user to query the information_schema.workload_groups system table, and the result will show the Workload Groups that the current user has permission to access. The following query result indicates that the current user has access to the g1 and normal Workload Groups:

SELECT name FROM information_schema.workload_groups;
+--------+
| name   |
+--------+
| normal |
| g1     |
+--------+

If the g1 Workload Group is not visible, you can use the ADMIN account to execute the GRANT statement to authorize the user. For example:

GRANT USAGE_PRIV ON WORKLOAD GROUP 'g1' TO 'user_1'@'%';

This statement means granting the user_1 the permission to use the Workload Group named g1. More details can be found in grant。

Two ways to bind Workload Group to user

By setting the user property, you can bind the user to a default Workload Group. The default is normal. It's important to note that the value here cannot be left empty; otherwise, the statement will fail.

set property 'default_workload_group' = 'g1';

After executing this statement, the current user's queries will default to using the 'g1' Workload Group.

By specifying the Workload Group through a session variable, the default is empty:

set workload_group = 'g1';

When both methods are used to specify a Workload Group for the user, the session variable takes priority over the user property.

Show Workload Group

You can use the SHOW statement to view the Workload Group:

show workload groups;

More details can be found in SHOW-WORKLOAD-GROUPS

You can view the Workload Group through the system table:

mysql [information_schema]>select * from information_schema.workload_groups where name='g1';
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
| ID    | NAME | CPU_SHARE | MEMORY_LIMIT | ENABLE_MEMORY_OVERCOMMIT | MAX_CONCURRENCY | MAX_QUEUE_SIZE | QUEUE_TIMEOUT | CPU_HARD_LIMIT | SCAN_THREAD_NUM | MAX_REMOTE_SCAN_THREAD_NUM | MIN_REMOTE_SCAN_THREAD_NUM | MEMORY_LOW_WATERMARK | MEMORY_HIGH_WATERMARK | TAG  | READ_BYTES_PER_SECOND | REMOTE_READ_BYTES_PER_SECOND |
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
| 14009 | g1   |      1024 | -1           | true                     |      2147483647 |              0 |             0 | -1             |              -1 |                         -1 |                         -1 | 50%                  | 80%                   |      |                    -1 |                           -1 |
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
1 row in set (0.05 sec)

Alter Workload Group

mysql [information_schema]>alter workload group g1 properties('cpu_share'='2048');
Query OK, 0 rows affected (0.00 sec

mysql [information_schema]>select cpu_share from information_schema.workload_groups where name='g1';
+-----------+
| cpu_share |
+-----------+
|      2048 |
+-----------+
1 row in set (0.02 sec)

More details can be found in ALTER-WORKLOAD-GROUP

Drop Workload Group

mysql [information_schema]>drop workload group g1;
Query OK, 0 rows affected (0.01 sec)

More details can be found inDROP-WORKLOAD-GROUP

Explanation of Switching Between CPU Soft and Hard Limit Modes

Currently, Doris does not support running both CPU soft and hard limits simultaneously. At any given time, a Doris cluster can only operate in either CPU soft limit or CPU hard limit mode. Users can switch between these two modes, and the switching method is as follows:

1 If the current cluster configuration is set to the default CPU soft limit and you wish to change it to CPU hard limit, you need to modify the cpu_hard_limit parameter of the Workload Group to a valid value.

alter workload group test_group properties ( 'cpu_hard_limit'='20%' );

All Workload Groups in the cluster need to be modified, and the cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%.

Since CPU hard limits cannot automatically have a valid value, simply enabling the switch without modifying the property will prevent the CPU hard limit from taking effect.

2 Enable the CPU hard limit on all FE nodes

1 Modify the configuration in the fe.conf file on the disk.
experimental_enable_cpu_hard_limit = true


2 Modify the configuration in memory.
ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true");

If the user wishes to switch from CPU hard limit back to CPU soft limit, they need to set the value of enable_cpu_hard_limit to false on all FE nodes. The CPU soft limit property cpu_share will default to a valid value of 1024 (if it was not previously specified). Users can adjust the cpu_share value based on the priority of the group.

Testing

Memory hard limit

Adhoc-type queries typically have unpredictable SQL inputs and uncertain memory usage, which poses the risk of a few queries consuming a large amount of memory. These types of workloads can be allocated to a separate group, and by using the Workload Group's memory hard limit feature, it helps prevent sudden large queries from consuming all memory, which could cause other queries to run out of available memory or result in OOM (Out of Memory) errors. When the memory usage of this Workload Group exceeds the configured hard limit, the system will kill queries to release memory, preventing the process from running out of memory.

Testing environment

1 FE, 1 BE, with BE configured to 96 cores and 375GB of memory.

The test dataset is clickbench, and the testing method involves using JMeter to run query Q29 with three concurrent executions.

Test without enabling memory hard limit for Workload Group

Check the memory usage of the process. The fourth column in the ps command output represents the physical memory usage of the process, in kilobytes (KB). It shows that under the current test load, the process uses approximately 7.7GB of memory.

[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         2.0 7896792
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         2.0 7929692
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         2.0 8101232

Use Doris system tables to check the current memory usage of the Workload Group. The memory usage of the Workload Group is approximately 5.8GB.

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb    |
+-------------------+
| 5797.524360656738 |
+-------------------+
1 row in set (0.01 sec)

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb    |
+-------------------+
| 5840.246627807617 |
+-------------------+
1 row in set (0.02 sec)

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb    |
+-------------------+
| 5878.394917488098 |
+-------------------+
1 row in set (0.02 sec)

Here, we can see that the process memory usage is typically much larger than the memory usage of a Workload Group, even if only one Workload Group is running. This is because the Workload Group only tracks the memory used by queries and loads The memory used by other components within the process, such as metadata and various caches, is not counted as part of the Workload Group's memory usage, nor is it managed by the Workload Group.

Test with the memory hard limit for Workload Group enabled

Execute the SQL command to modify the memory configuration.

alter workload group g2 properties('memory_limit'='0.5%');
alter workload group g2 properties('enable_memory_overcommit'='false');

Run the same test and check the memory usage in the system table; the memory usage is around 1.5G.

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+--------------------+
| wg_mem_used_mb     |
+--------------------+
| 1575.3877239227295 |
+--------------------+
1 row in set (0.02 sec)

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+------------------+
| wg_mem_used_mb   |
+------------------+
| 1668.77405834198 |
+------------------+
1 row in set (0.01 sec)

mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+--------------------+
| wg_mem_used_mb     |
+--------------------+
| 499.96760272979736 |
+--------------------+
1 row in set (0.01 sec)

Use the ps command to check the memory usage of the process; the memory usage is around 3.8G.

[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         1.0 4071364
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         1.0 4059012
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be         1.0 4057068

At the same time, the client will observe a significant number of query failures caused by insufficient memory.

1724074250162,14126,1c_sql,HY000 1105,"java.sql.SQLException: errCode = 2, detailMessage = (127.0.0.1)[MEM_LIMIT_EXCEEDED]GC wg for hard limit, wg id:11201, name:g2, used:1.71 GB, limit:1.69 GB, backend:10.16.10.8. cancel top memory used tracker <Query#Id=4a0689936c444ac8-a0d01a50b944f6e7> consumption 1.71 GB. details:process memory used 3.01 GB exceed soft limit 304.41 GB or sys available memory 101.16 GB less than warning water mark 12.80 GB., Execute again after enough memory, details see be.INFO.",并发 1-3,text,false,,444,0,3,3,null,0,0,0

From the error message, it can be observed that the Workload Group used 1.7G of memory, but the Workload Group's limit is 1.69G. The calculation is as follows:1.69G = Physical machine memory (375G) * mem_limit (value from be.conf, default is 0.9) * 0.5% (Workload Group's configuration). This means the memory percentage configured in the Workload Group is calculated based on the memory available to the BE process.

Recommendations

As demonstrated in the tests above, memory hard limits can control the memory usage of a Workload Group but do so by terminating queries to release memory. This approach can lead to a poor user experience and, in extreme cases, may cause all queries to fail.

Therefore, in production environments, it is recommended to use memory hard limits in conjunction with query queuing functionality. This ensures controlled memory usage while maintaining query success rates.

CPU hard limit

Doris workloads can generally be categorized into three types:

Core Report Queries: These are typically used by company executives to view reports. While the load may not be very high, the availability requirements are strict. These queries can be assigned to a group with a higher-priority soft limit, ensuring they receive more CPU resources when resources are insufficient.
Adhoc queries are typically exploratory and analytical in nature, with random SQL and unpredictable resource consumption. Their priority is usually low. Therefore, CPU hard limits can be used to manage these queries, configuring lower values to prevent excessive CPU resource usage that could reduce cluster availability.
ETL queries typically have fixed SQL and stable resource consumption, although there may occasionally be spikes in resource usage due to increased upstream data. Therefore, CPU hard limits can be configured to manage these queries.

Different workloads have varying CPU consumption, and users have different latency requirements. When the BE CPU is fully utilized, availability decreases, and response times increase. For example, an Adhoc analysis query may fully utilize the CPU of the entire cluster, causing core report queries to experience higher latency, which impacts SLA. Therefore, a CPU isolation mechanism is needed to separate different workloads and ensure cluster availability and SLA.

Workload Group supports both CPU soft limits and hard limits. It is currently recommended to configure Workload Groups with hard limits in production environments. This is because CPU soft limits typically only show priority effects when the CPU is fully utilized. However, when the CPU is fully used, internal Doris components (such as the RPC component) and the operating system’s available CPU are reduced, leading to a significant drop in overall cluster availability. Therefore, in production environments, it is essential to avoid CPU resource exhaustion, and the same logic applies to other resources such as memory.

Test environment

1 FE, 1 BE, 96-core machine. The dataset is clickbench, and the test SQL is q29.

Tesing

Using JMeter to initiate 3 concurrent queries, the CPU usage of the BE process is pushed to a relatively high usage rate. The test machine has 96 cores, and using the top command, we can see that the BE process's CPU usage is 7600%, which means the process is currently using 76 cores.
Modify the CPU hard limit of the currently used Workload Group to 10%.
```
alter workload group g2 properties('cpu_hard_limit'='10%');
```

Switch to CPU hard limit mode.

ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true");

Re-run the load test for queries, and you can see that the current process can only use 9 to 10 cores, which is about 10% of the total cores.

It is important to note that this test is best conducted using query workloads, as they are more likely to reflect the effect. If testing load, it may trigger Compaction, causing the actual observed values to be higher than the values configured in the Workload Group. Currently, Compaction workloads are not managed under the Workload Group.

In addition to using Linux system commands, you can also observe the current CPU usage of the group through Doris's system tables, where the CPU usage is around 10%.

mysql [information_schema]>select CPU_USAGE_PERCENT from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+-------------------+
| CPU_USAGE_PERCENT |
+-------------------+
|              9.57 |
+-------------------+
1 row in set (0.02 sec)

note

When configuring, it's better not to set the total CPU allocation of all groups to exactly 100%. This is mainly to ensure the availability of low-latency scenarios, as some resources need to be reserved for other components. However, for scenarios that are not very sensitive to latency and aim for maximum resource utilization, setting the total CPU allocation of all groups to 100% can be considered.
Currently, the interval for synchronizing Workload Group metadata from FE to BE is 30 seconds. Therefore, changes to Workload Group settings may take up to 30 seconds to take effect.

Limit local IO

In OLAP systems, during ETL operations or large Adhoc queries, a significant amount of data needs to be read. To speed up the data analysis process, Doris uses multi-threaded parallel scanning across multiple disk files, which generates substantial disk IO that can impact other queries (such as report analysis). By using Workload Groups, Doris can group offline ETL data processing and online report queries separately, limiting the offline data processing IO bandwidth. This helps reduce the impact of offline data processing on online report analysis.

Test environment

1 FE, 1 BE, 96-core machine. Dataset: clickbench. Test query: q29.

Testing without enabling IO hard limits

Clear Cache.

// clear OS cache
sync; echo 3 > /proc/sys/vm/drop_caches

// disable BE page cache
disable_storage_page_cache = true

Perform a full table scan on the clickbench table, and execute a single concurrent query.
```
set dry_run_query = true;
select * from hits.hits;
```

Check the maximum throughput of the current Group as 3GB per second through Doris's system table.

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 1146.6208400726318 |
+--------------------+
1 row in set (0.03 sec)

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 3496.2762966156006 |
+--------------------+
1 row in set (0.04 sec)

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 2192.7690029144287 |
+--------------------+
1 row in set (0.02 sec)

Use the pidstat command to check the process IO. The first column is the process ID, and the second column is the read IO throughput (in kb/s). It can be seen that when IO is not restricted, the maximum throughput is 2GB per second.

Test after enabling IO hard limit

Clear cache.

// Clear OS cache.
sync; echo 3 > /proc/sys/vm/drop_caches

// disable BE page cache
disable_storage_page_cache = true

Modify the Workload Group configuration to limit the maximum throughput to 100M per second.
```
alter workload group g2 properties('read_bytes_per_second'='104857600');
```

Use Doris system tables to check that the maximum IO throughput of the Workload Group is 98M per second.

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 97.94296646118164  |
+--------------------+
1 row in set (0.03 sec)

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 98.37584781646729  |
+--------------------+
1 row in set (0.04 sec)

mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec         |
+--------------------+
| 98.06641292572021  |
+--------------------+
1 row in set (0.02 sec)

Use the pid tool to check that the maximum IO throughput of the process is 131M per second.

Note

The LOCAL_SCAN_BYTES_PER_SECOND field in the system table represents the summary value of the current Workload Group's statistics at the process level. For example, if 12 file paths are configured, LOCAL_SCAN_BYTES_PER_SECOND is the maximum IO value of these 12 file paths. If you wish to view the IO throughput for each file path separately, you can check the detailed values in Grafana.
Due to the presence of the operating system and Doris's Page Cache, the IO observed through Linux's IO monitoring scripts is typically smaller than the IO seen in the system table.

Limit remote IO

BrokerLoad and S3Load are commonly used methods for large-scale data load. Users can first upload data to HDFS or S3, and then use BrokerLoad and S3Load to load data in parallel. To speed up the load process, Doris uses multi-threading to pull data from HDFS/S3, which can generate significant pressure on HDFS/S3, potentially making other jobs running on HDFS/S3 unstable.

To mitigate the impact on other workloads, the Workload Group's remote IO limit feature can be used to restrict the bandwidth used during the load process from HDFS/S3. This helps reduce the impact on other business operations.

Test environment

1 FE and 1 BE are deployed on the same machine, configured with 16 cores and 64GB of memory. The test data is the clickbench dataset, and before testing, we need to upload the dataset to S3. Considering the upload time, we will only upload 10 million rows of data, and then use the TVF function to query the data from S3.

After the upload is successful, you can use the command to view the schema information.

```sql
DESC FUNCTION s3 (
    "URI" = "https://bucketname/1kw.tsv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "use_path_style"="true"
);
```

Test without restricting remote read IO

Initiate a single-threaded test to perform a full table scan on the clickbench table.

// Set the operation to only scan the data without returning results.
set dry_run_query = true;

SELECT * FROM s3(
    "URI" = "https://bucketname/1kw.tsv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "use_path_style"="true"
);

Use the system table to check the current remote IO throughput. It shows that the remote IO throughput for this query is 837 MB per second. Note that the actual IO throughput here is highly dependent on the environment. If the machine hosting the BE has limited bandwidth to the external storage, the actual throughput may be lower.

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|     837 |
+---------+
1 row in set (0.104 sec)

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|     867 |
+---------+
1 row in set (0.070 sec)

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|     867 |
+---------+
1 row in set (0.186 sec)

Use the sar command (sar -n DEV 1 3600) to monitor the machine's network bandwidth. It shows that the maximum network bandwidth at the machine level is 1033 MB per second. The first column of the output represents the number of bytes received per second by a specific network interface on the machine, in KB per second.

Test limiting remote read IO

Modify the Workload Group configuration to limit remote read IO throughput to 100M per second.
```
alter workload group normal properties('remote_read_bytes_per_second'='104857600');
```

Initiate a single concurrent full table scan query.

set dry_run_query = true;

SELECT * FROM s3(
    "URI" = "https://bucketname/1kw.tsv",
    "s3.access_key"= "ak",
    "s3.secret_key" = "sk",
    "format" = "csv",
    "use_path_style"="true"
);

Use the system table to check the current remote read IO throughput. At this time, the IO throughput is around 100M, with some fluctuations. These fluctuations are influenced by the current algorithm design, typically peaking briefly without persisting for long periods, which is considered normal.

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|      56 |
+---------+
1 row in set (0.010 sec)

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|     131 |
+---------+
1 row in set (0.009 sec)

MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
|     111 |
+---------+
1 row in set (0.009 sec)

Use the sar command (sar -n DEV 1 3600) to monitor the current network card's received traffic. The first column represents the amount of data received per second. The maximum value observed is now 207M per second, indicating that the read IO limit is effective. However, since the sar command reflects machine-level traffic, the observed value is slightly higher than what Doris reports.

Frequently Asked Questions

Why is the CPU hard limit configuration not taking effect?

This is usually caused by the following reasons:
- Environment initialization failed. You need to check the two configuration files under the Doris CGroup path. Here, we take the CGroup V1 version as an example. If the user has specified the Doris CGroup path as /sys/fs/cgroup/cpu/doris/, you should first check if the content of /sys/fs/cgroup/cpu/doris/query/1/tasks contains the thread IDs corresponding to the Workload Group. The "1" in the path represents the Workload Group ID, which can be obtained by running the command top -H -b -n 1 -p pid to find the thread IDs of the Workload Group. After confirming, ensure that the thread IDs of the Workload Group are written into the tasks file. Then, check if the value of /sys/fs/cgroup/cpu/doris/query/1/cpu.cfs_quota_us is -1. If it is -1, it means the CPU hard limit configuration has not taken effect.
- The CPU usage of the Doris BE process is higher than the CPU hard limit configured for the Workload Group. This is expected because the CPU managed by the Workload Group is primarily for query threads and memtable flush threads for Load. However, the BE process typically has other components consuming CPU as well, such as Compaction. Therefore, the CPU usage of the process is generally higher than the configured limit for the Workload Group. You can create a test Workload Group that only stresses the query load and then check the CPU usage of the Workload Group through the system table information_schema.workload_group_resource_usage. This table only records the CPU usage of the Workload Group and has been supported since version 2.1.6.
- Some users have configured the cpu_resource_limit.First, execute show property for jack like 'cpu_resource_limit' to check whether this parameter is set in the properties of the user jack. Then, execute show variables like 'cpu_resource_limit' to verify whether this parameter is set in the session variables. The default value of this parameter is -1, which indicates that it is not set. After configuring this parameter, queries are handled by an independent thread pool that is not managed by the Workload Group. Directly modifying this parameter may affect the stability of the production environment. It is recommended to gradually migrate the query loads that are configured with this parameter to be managed by the Workload Group. The current alternative to this parameter is the session variable num_scanner_threads. The main process is as follows: First, divide the users who have configured cpu_resource_limit into several batches. When migrating the first batch of users, modify the session variable num_scanner_threads for these users to 1. Then, assign a Workload Group to these users. After that, change cpu_resource_limit to -1 and observe the cluster's stability over a period of time. If the cluster remains stable, continue migrating the next batch of users.

Why is the default number of Workload Groups limited to 15?

Workload Groups are primarily used for dividing resources on a single machine. If too many Workload Groups are created on one machine, each Workload Group will only receive a very small portion of the resources. If the user indeed requires creating this many Workload Groups, you can consider dividing the cluster into multiple groups of BEs and then creating different Workload Groups for each group of BEs. You can also temporarily bypass this limit by modifying the FE configuration workload_group_max_num.

Why does the error "Resource temporarily unavailable" occur after configuring many Workload Groups?

Each Workload Group corresponds to an independent thread pool. Creating too many Workload Groups may cause the BE process to attempt to start too many threads, exceeding the maximum number of threads allowed for a process by the operating system. To resolve this issue, you can modify the system environment configuration to allow the BE process to create more threads.

Version Notes​

Configuring Workload Group​

Setting Up the CGroup Environment​

Considerations for Using Workload Group in Containers​

Create Workload Group​

Workload Group Properties​

Set Workload Group for user​

Show Workload Group​

Alter Workload Group​

Drop Workload Group​

Explanation of Switching Between CPU Soft and Hard Limit Modes​

Testing​

Memory hard limit​

CPU hard limit​

Limit local IO​

Limit remote IO​

Frequently Asked Questions​

Version Notes

Configuring Workload Group

Setting Up the CGroup Environment

Considerations for Using Workload Group in Containers

Create Workload Group

Workload Group Properties

Set Workload Group for user

Show Workload Group

Alter Workload Group

Drop Workload Group

Explanation of Switching Between CPU Soft and Hard Limit Modes

Testing

Memory hard limit

CPU hard limit

Limit local IO

Limit remote IO

Frequently Asked Questions