Workload Group
Workload Group is an in-process mechanism for isolating workloads. It achieves resource isolation by finely partitioning or limiting resources (CPU, IO, Memory) within the BE process. Its principle is illustrated in the diagram below:
The currently supported isolation capabilities include:
- Managing CPU resources, with support for both cpu hard limit and cpu soft limit;
- Managing memory resources, with support for both memory hard limit and memory soft limit;
- Managing IO resources, including IO generated by reading local and remote files.
Version Notesβ
-
The Workload Group feature has been available since Doris 2.0. In Doris 2.0, the Workload Group feature does not rely on CGroup, but starting with Doris 2.1, it requires CGroup.
-
Upgrading from Doris 1.2 to 2.0: It is recommended to enable the Workload Group feature only after the entire cluster has been upgraded. If only some follower FE nodes are upgraded, queries on the upgraded follower FE nodes may fail due to the absence of Workload Group metadata on the non-upgraded FE nodes.
-
Upgrading from Doris 2.0 to 2.1: Since the Workload Group feature in Doris 2.1 relies on CGroup, you need to configure the CGroup environment before upgrading to Doris 2.1.
Configuring Workload Groupβ
Setting Up the CGroup Environmentβ
Workload Group supports managing CPU, memory, and IO. CPU management relies on the CGroup component. To use Workload Group for CPU resource management, you must first configure the CGroup environment.
The following are the steps for configuring the CGroup environment:
- First, verify whether the node where the BE is located has CGroup installed. If the output includes cgroup, it indicates that CGroup V1 is installed in the current environment. If it includes cgroup2, it indicates that CGroup V2 is installed. You can determine which version is active in the next step.
cat /proc/filesystems | grep cgroup
nodev cgroup
nodev cgroup2
nodev cgroupfs
- The active CGroup version can be confirmed based on the path name.
If this path exists, it indicates that CGroup V1 is currently active.
/sys/fs/cgroup/cpu/
If this path exists, it indicates that CGroup V2 is currently active.
/sys/fs/cgroup/cgroup.controllers
- Create a directory named doris under the CGroup path. The directory name can be customized by the user.
If using CGroup V1, create the directory under the cpu directory.
mkdir /sys/fs/cgroup/cpu/doris
If using CGroup V2, create the directory directly under the cgroup directory.
mkdir /sys/fs/cgroup/doris
- Ensure that the Doris BE process has read, write, and execute permissions for this directory.
// If using CGroup V1, the command is as follows:
// 1. Modify the directory's permissions to be readable, writable, and executable.
chmod 770 /sys/fs/cgroup/cpu/doris
// 2. Change the ownership of this directory to the doris account.
chown -R doris:doris /sys/fs/cgroup/cpu/doris
// If using CGroup V2, the command is as follows:
// 1.Modify the directory's permissions to be readable, writable, and executable.
chmod 770 /sys/fs/cgroup/doris
// 2. Change the ownership of this directory to the doris account.
chown -R doris:doris /sys/fs/cgroup/doris
- If the current environment is using CGroup v2, the following steps are required. If it is CGroup v1, this step can be skipped.
- Modify the permissions of the cgroup.procs file in the root directory. This is necessary because CGroup v2 has stricter permission controls, and write permissions to the cgroup.procs file in the root directory are required to move processes between CGroup directories.
chmod a+w /sys/fs/cgroup/cgroup.procs
- In CGroup v2, the cgroup.controllers file lists the available controllers for the current directory, and the cgroup.subtree_control file lists the controllers available for the subdirectories. Therefore, it is necessary to check if the doris directory has the cpu controller enabled. If the cgroup.controllers file in the doris directory does not include cpu, it means the cpu controller is not enabled. You can enable it by executing the following command in the doris directory. This command works by modifying the cgroup.subtree_control file in the parent directory to allow the doris directory to use the cpu controller.
// After running this command, you should be able to see the cpu.max file in the doris directory,
// and the output of cgroup.controllers should include cpu.
// If the command fails, it means that the parent directory of doris also does not have the cpu controller enabled,
// and you will need to enable the cpu controller for the parent directory.
echo +cpu > ../cgroup.subtree_control
- Modify the BE configuration to specify the path of the cgroup.
If using CGroup V1, the configuration path is as follows:
doris_cgroup_cpu_path = /sys/fs/cgroup/cpu/doris
If using CGroup V2, the configuration path is as follows:
doris_cgroup_cpu_path = /sys/fs/cgroup/doris
- Restart the BE, and in the log (be.INFO), the phrase 'add thread xxx to group' indicates that the configuration was successful.
- It is recommended to deploy only one BE per machine, as the current Workload Group feature does not support deploying multiple BE instances on a single machine.
- After a machine is restarted, all configurations under the CGroup path will be cleared. To persist the CGroup configuration, you can use systemd to set the operation as a custom system service, so that the creation and authorization operations can be automatically performed each time the machine restarts.
- If using CGroup within a container, the container must have permission to operate on the host machine.
Considerations for Using Workload Group in Containersβ
Workload's CPU management is based on CGroup. If you want to use Workload Group inside a container, the container needs to be started in privileged mode so that the BE process inside the container has permission to read and write CGroup files on the host machine.
When BE runs inside a container, the CPU resource usage for Workload Group is partitioned based on the available resources of the container. For example, if the host machine has 64 cores and the container is allocated 8 cores, and the Workload Group is configured with a 50% CPU hard limit, the actual available CPU cores for the Workload Group will be 4 (8 cores * 50%).
The memory and IO management functions of Workload Group are implemented internally by Doris and do not rely on external components, so there is no difference in deployment between containers and physical machines.
If you want to use Doris on K8S, it is recommended to deploy it using the Doris Operator, which can shield underlying permission issues.
Create Workload Groupβ
mysql [information_schema]>create workload group if not exists g1
-> properties (
-> "cpu_share"="1024"
-> );
Query OK, 0 rows affected (0.03 sec)
You can refer to CREATE-WORKLOAD-GROUPγ
The CPU limit configured at this point is a soft limit. Since version 2.1, Doris will automatically create a group named normal, which cannot be deleted.
Workload Group Propertiesβ
Property | Data type | Default value | Value range | Description |
---|---|---|---|---|
cpu_share | Integer | -1 | [1, 10000] | Optional, effective under CPU soft limit mode. The valid range of values depends on the version of CGroup being used, which is described in detail later. cpu_share represents the weight of CPU time that the Workload Group can acquire; the larger the value, the more CPU time it can obtain. For example, if the user creates three Workload Groups, g-a, g-b, and g-c, with cpu_share values of 10, 30, and 40 respectively, and at a certain point g-a and g-b are running tasks while g-c has no tasks, g-a will receive 25% (10 / (10 + 30)) of the CPU resources, and g-b will receive 75% of the CPU resources. If only one Workload Group is running in the system, regardless of the cpu_share value, it will be able to acquire all the CPU resources. |
memory_limit | Float | -1 | (0%, 100%] | Optional. Enabling memory hard limit represents the maximum available memory percentage for the current Workload Group. The default value means no memory limit is applied. The cumulative value of memory_limit for all Workload Groups cannot exceed 100%, and it is typically used in conjunction with the enable_memory_overcommit attribute. For example, if a machine has 64GB of memory and the memory_limit for a Workload Group is set to 50%, the actual physical memory available for that group would be 64GB * 90% * 50% = 28.8GB, where 90% is the default value for the available memory configuration of the BE process. |
enable_memory_overcommit | Boolean | true | true, false | Optional. Used to control whether the memory limit for the current Workload Group is a hard limit or a soft limit, with the default set to true. If set to false, the Workload Group will have hard memory limit, and when the system detects that the memory usage exceeds the limit, it will immediately cancel the tasks with the highest memory usage within the group to release the excess memory. If set to true, the Workload Group will have soft memory limit. If there is free memory available, the Workload Group can continue using system memory even after exceeding the memory_limit. When the system's total memory is under pressure, the system will cancel the tasks with the highest memory usage within the group and release some of the excess memory to alleviate system memory pressure. It is recommended that the total memory_limit of all Workload Groups be kept below 100% to reserve memory for other components of the BE process. |
cpu_hard_limit | Integer | -1 | [1%, 100%] | Optional. Effective under CPU hard limit mode, it represents the maximum CPU percentage a Workload Group can use. Regardless of whether the machine's CPU resources are fully utilized, the Workload Group's CPU usage cannot exceed the cpu_hard_limit. The cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%. This attribute was introduced in version 2.1 and is not supported in version 2.0. |
max_concurrency | Integer | 2147483647 | [0, 2147483647] | Optional. Specifies the maximum query concurrency. The default value is the maximum value of an integer, meaning no concurrency limit. When the number of running queries reaches the maximum concurrency, new queries will enter a queue. |
max_queue_size | Integer | 0 | [0, 2147483647] | Optional. Specifies the length of the query waiting queue. When the queue is full, new queries will be rejected. The default value is 0, which means no queuing. If the queue is full, new queries will fail directly. |
queue_timeout | Integer | 0 | [0, 2147483647] | Optional. Specifies the maximum waiting time for a query in the waiting queue, in milliseconds. If the query's waiting time in the queue exceeds this value, an exception will be thrown directly to the client. The default value is 0, meaning no queuing; queries will immediately fail upon entering the queue. |
scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the number of threads used for scanning in the current Workload Group. When this property is set to -1, it means it is not active, and the actual scan thread num on the BE will default to the doris_scanner_thread_pool_thread_num configuration in the BE. |
max_remote_scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the maximum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores. |
min_remote_scan_thread_num | Integer | -1 | [1, 2147483647] | Optional. Specifies the minimum number of threads in the scan thread pool for reading external data sources. When this property is set to -1, the actual number of threads is determined by the BE, typically based on the number of CPU cores. |
tag | String | empty | - | Specifies tags for the Workload Group. The cumulative resource values of Workload Groups with the same tag cannot exceed 100%. To specify multiple values, use commas to separate them. |
read_bytes_per_second | Integer | -1 | [1, 9223372036854775807] | Optional. Specifies the maximum I/O throughput when reading internal tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied. It is important to note that this value is not tied to individual disks but to directories. For example, if Doris is configured with two directories to store internal table data, the maximum read I/O for each directory will not exceed this value. If both directories are placed on the same disk, the maximum throughput will be doubled (i.e., 2 times read_bytes_per_second). The file directory for spill disk is also subject to this limit. |
remote_read_bytes_per_second | Integer | -1 | [1, 9223372036854775807] | Optional. Specifies the maximum I/O throughput when reading external tables in Doris. The default value is -1, meaning no I/O bandwidth limit is applied. |
-
Currently, the simultaneous use of both cpu hard limit and cpu soft limit is not supported. At any given time, a cluster can only have either a soft limit or a hard limit. The method for switching between them will be described later.
-
All properties are optional, but at least one property must be specified when creating a Workload Group.
-
It is important to note that the default values for CPU soft limits differ between CGroup v1 and CGroup v2. The default CPU soft limit for CGroup v1 is 1024, with a valid range from 2 to 262144, while the default for CGroup v2 is 100, with a valid range from 1 to 10000. If a value outside the range is set for the soft limit, it may cause the CPU soft limit modification to fail in BE. If the default value of 100 from CGroup v2 is applied in a CGroup v1 environment, it could result in this Workload Group having the lowest priority on the machine.
Set Workload Group for userβ
Before binding a user to a specific Workload Group, it is necessary to ensure that the user has the necessary permissions for the Workload Group. You can use the user to query the information_schema.workload_groups system table, and the result will show the Workload Groups that the current user has permission to access. The following query result indicates that the current user has access to the g1 and normal Workload Groups:
SELECT name FROM information_schema.workload_groups;
+--------+
| name |
+--------+
| normal |
| g1 |
+--------+
If the g1 Workload Group is not visible, you can use the ADMIN account to execute the GRANT statement to authorize the user. For example:
"GRANT USAGE_PRIV ON WORKLOAD GROUP 'g1' TO 'user_1'@'%';"
This statement means granting the user_1 the permission to use the Workload Group named g1. More details can be found in grantγ
Two ways to bind Workload Group to user
- By setting the user property, you can bind the user to a default Workload Group. The default is normal. It's important to note that the value here cannot be left empty; otherwise, the statement will fail.
set property 'default_workload_group' = 'g1';
After executing this statement, the current user's queries will default to using the 'g1' Workload Group.
- By specifying the Workload Group through a session variable, the default is empty:
set workload_group = 'g1';
When both methods are used to specify a Workload Group for the user, the session variable takes priority over the user property.
Show Workload Groupβ
- You can use the SHOW statement to view the Workload Group:
show workload groups;
More details can be found in SHOW-WORKLOAD-GROUPS
- You can view the Workload Group through the system table:
mysql [information_schema]>select * from information_schema.workload_groups where name='g1';
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
| ID | NAME | CPU_SHARE | MEMORY_LIMIT | ENABLE_MEMORY_OVERCOMMIT | MAX_CONCURRENCY | MAX_QUEUE_SIZE | QUEUE_TIMEOUT | CPU_HARD_LIMIT | SCAN_THREAD_NUM | MAX_REMOTE_SCAN_THREAD_NUM | MIN_REMOTE_SCAN_THREAD_NUM | MEMORY_LOW_WATERMARK | MEMORY_HIGH_WATERMARK | TAG | READ_BYTES_PER_SECOND | REMOTE_READ_BYTES_PER_SECOND |
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
| 14009 | g1 | 1024 | -1 | true | 2147483647 | 0 | 0 | -1 | -1 | -1 | -1 | 50% | 80% | | -1 | -1 |
+-------+------+-----------+--------------+--------------------------+-----------------+----------------+---------------+----------------+-----------------+----------------------------+----------------------------+----------------------+-----------------------+------+-----------------------+------------------------------+
1 row in set (0.05 sec)
Alter Workload Groupβ
mysql [information_schema]>alter workload group g1 properties('cpu_share'='2048');
Query OK, 0 rows affected (0.00 sec
mysql [information_schema]>select cpu_share from information_schema.workload_groups where name='g1';
+-----------+
| cpu_share |
+-----------+
| 2048 |
+-----------+
1 row in set (0.02 sec)
More details can be found in ALTER-WORKLOAD-GROUP
Drop Workload Groupβ
mysql [information_schema]>drop workload group g1;
Query OK, 0 rows affected (0.01 sec)
More details can be found inDROP-WORKLOAD-GROUP
Explanation of Switching Between CPU Soft and Hard Limit Modesβ
Currently, Doris does not support running both CPU soft and hard limits simultaneously. At any given time, a Doris cluster can only operate in either CPU soft limit or CPU hard limit mode. Users can switch between these two modes, and the switching method is as follows:
1 If the current cluster configuration is set to the default CPU soft limit and you wish to change it to CPU hard limit, you need to modify the cpu_hard_limit parameter of the Workload Group to a valid value.
alter workload group test_group properties ( 'cpu_hard_limit'='20%' );
All Workload Groups in the cluster need to be modified, and the cumulative value of cpu_hard_limit for all Workload Groups cannot exceed 100%.
Since CPU hard limits cannot automatically have a valid value, simply enabling the switch without modifying the property will prevent the CPU hard limit from taking effect.
2 Enable the CPU hard limit on all FE nodes
1 Modify the configuration in the fe.conf file on the disk.
experimental_enable_cpu_hard_limit = true
2 Modify the configuration in memory.
ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true");
If the user wishes to switch from CPU hard limit back to CPU soft limit, they need to set the value of enable_cpu_hard_limit to false on all FE nodes. The CPU soft limit property cpu_share will default to a valid value of 1024 (if it was not previously specified). Users can adjust the cpu_share value based on the priority of the group.
Testingβ
Memory hard limitβ
Adhoc-type queries typically have unpredictable SQL inputs and uncertain memory usage, which poses the risk of a few queries consuming a large amount of memory. These types of workloads can be allocated to a separate group, and by using the Workload Group's memory hard limit feature, it helps prevent sudden large queries from consuming all memory, which could cause other queries to run out of available memory or result in OOM (Out of Memory) errors. When the memory usage of this Workload Group exceeds the configured hard limit, the system will kill queries to release memory, preventing the process from running out of memory.
Testing environment
1 FE, 1 BE, with BE configured to 96 cores and 375GB of memory.
The test dataset is clickbench, and the testing method involves using JMeter to run query Q29 with three concurrent executions.
Test without enabling memory hard limit for Workload Group
-
Check the memory usage of the process. The fourth column in the ps command output represents the physical memory usage of the process, in kilobytes (KB). It shows that under the current test load, the process uses approximately 7.7GB of memory.
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 2.0 7896792
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 2.0 7929692
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 2.0 8101232 -
Use Doris system tables to check the current memory usage of the Workload Group. The memory usage of the Workload Group is approximately 5.8GB.
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb |
+-------------------+
| 5797.524360656738 |
+-------------------+
1 row in set (0.01 sec)
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb |
+-------------------+
| 5840.246627807617 |
+-------------------+
1 row in set (0.02 sec)
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+-------------------+
| wg_mem_used_mb |
+-------------------+
| 5878.394917488098 |
+-------------------+
1 row in set (0.02 sec)
Here, we can see that the process memory usage is typically much larger than the memory usage of a Workload Group, even if only one Workload Group is running. This is because the Workload Group only tracks the memory used by queries and loads The memory used by other components within the process, such as metadata and various caches, is not counted as part of the Workload Group's memory usage, nor is it managed by the Workload Group.
Test with the memory hard limit for Workload Group enabled
-
Execute the SQL command to modify the memory configuration.
alter workload group g2 properties('memory_limit'='0.5%');
alter workload group g2 properties('enable_memory_overcommit'='false'); -
Run the same test and check the memory usage in the system table; the memory usage is around 1.5G.
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+--------------------+
| wg_mem_used_mb |
+--------------------+
| 1575.3877239227295 |
+--------------------+
1 row in set (0.02 sec)
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+------------------+
| wg_mem_used_mb |
+------------------+
| 1668.77405834198 |
+------------------+
1 row in set (0.01 sec)
mysql [information_schema]>select MEMORY_USAGE_BYTES / 1024/ 1024 as wg_mem_used_mb from workload_group_resource_usage where workload_group_id=11201;
+--------------------+
| wg_mem_used_mb |
+--------------------+
| 499.96760272979736 |
+--------------------+
1 row in set (0.01 sec) -
Use the ps command to check the memory usage of the process; the memory usage is around 3.8G.
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 1.0 4071364
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 1.0 4059012
[ ~]$ ps -eo pid,comm,%mem,rss | grep 1407481
1407481 doris_be 1.0 4057068 -
At the same time, the client will observe a significant number of query failures caused by insufficient memory.
1724074250162,14126,1c_sql,HY000 1105,"java.sql.SQLException: errCode = 2, detailMessage = (127.0.0.1)[MEM_LIMIT_EXCEEDED]GC wg for hard limit, wg id:11201, name:g2, used:1.71 GB, limit:1.69 GB, backend:10.16.10.8. cancel top memory used tracker <Query#Id=4a0689936c444ac8-a0d01a50b944f6e7> consumption 1.71 GB. details:process memory used 3.01 GB exceed soft limit 304.41 GB or sys available memory 101.16 GB less than warning water mark 12.80 GB., Execute again after enough memory, details see be.INFO.",εΉΆε 1-3,text,false,,444,0,3,3,null,0,0,0
From the error message, it can be observed that the Workload Group used 1.7G of memory, but the Workload Group's limit is 1.69G. The calculation is as follows:1.69G = Physical machine memory (375G) * mem_limit (value from be.conf, default is 0.9) * 0.5% (Workload Group's configuration). This means the memory percentage configured in the Workload Group is calculated based on the memory available to the BE process.
Recommendations
As demonstrated in the tests above, memory hard limits can control the memory usage of a Workload Group but do so by terminating queries to release memory. This approach can lead to a poor user experience and, in extreme cases, may cause all queries to fail.
Therefore, in production environments, it is recommended to use memory hard limits in conjunction with query queuing functionality. This ensures controlled memory usage while maintaining query success rates.
CPU hard limitβ
Doris workloads can generally be categorized into three types:
- Core Report Queries: These are typically used by company executives to view reports. While the load may not be very high, the availability requirements are strict. These queries can be assigned to a group with a higher-priority soft limit, ensuring they receive more CPU resources when resources are insufficient.
- Adhoc queries are typically exploratory and analytical in nature, with random SQL and unpredictable resource consumption. Their priority is usually low. Therefore, CPU hard limits can be used to manage these queries, configuring lower values to prevent excessive CPU resource usage that could reduce cluster availability.
- ETL queries typically have fixed SQL and stable resource consumption, although there may occasionally be spikes in resource usage due to increased upstream data. Therefore, CPU hard limits can be configured to manage these queries.
Different workloads have varying CPU consumption, and users have different latency requirements. When the BE CPU is fully utilized, availability decreases, and response times increase. For example, an Adhoc analysis query may fully utilize the CPU of the entire cluster, causing core report queries to experience higher latency, which impacts SLA. Therefore, a CPU isolation mechanism is needed to separate different workloads and ensure cluster availability and SLA.
Workload Group supports both CPU soft limits and hard limits. It is currently recommended to configure Workload Groups with hard limits in production environments. This is because CPU soft limits typically only show priority effects when the CPU is fully utilized. However, when the CPU is fully used, internal Doris components (such as the RPC component) and the operating systemβs available CPU are reduced, leading to a significant drop in overall cluster availability. Therefore, in production environments, it is essential to avoid CPU resource exhaustion, and the same logic applies to other resources such as memory.
Test environment
1 FE, 1 BE, 96-core machine. The dataset is clickbench, and the test SQL is q29.
Tesing
-
Using JMeter to initiate 3 concurrent queries, the CPU usage of the BE process is pushed to a relatively high usage rate. The test machine has 96 cores, and using the top command, we can see that the BE process's CPU usage is 7600%, which means the process is currently using 76 cores.
-
Modify the CPU hard limit of the currently used Workload Group to 10%.
alter workload group g2 properties('cpu_hard_limit'='10%');
-
Switch to CPU hard limit mode.
ADMIN SET FRONTEND CONFIG ("enable_cpu_hard_limit" = "true");
-
Re-run the load test for queries, and you can see that the current process can only use 9 to 10 cores, which is about 10% of the total cores.
It is important to note that this test is best conducted using query workloads, as they are more likely to reflect the effect. If testing load, it may trigger Compaction, causing the actual observed values to be higher than the values configured in the Workload Group. Currently, Compaction workloads are not managed under the Workload Group.
-
In addition to using Linux system commands, you can also observe the current CPU usage of the group through Doris's system tables, where the CPU usage is around 10%.
mysql [information_schema]>select CPU_USAGE_PERCENT from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+-------------------+
| CPU_USAGE_PERCENT |
+-------------------+
| 9.57 |
+-------------------+
1 row in set (0.02 sec)
note
- When configuring, it's better not to set the total CPU allocation of all groups to exactly 100%. This is mainly to ensure the availability of low-latency scenarios, as some resources need to be reserved for other components. However, for scenarios that are not very sensitive to latency and aim for maximum resource utilization, setting the total CPU allocation of all groups to 100% can be considered.
- Currently, the interval for synchronizing Workload Group metadata from FE to BE is 30 seconds. Therefore, changes to Workload Group settings may take up to 30 seconds to take effect.
Limit local IOβ
In OLAP systems, during ETL operations or large Adhoc queries, a significant amount of data needs to be read. To speed up the data analysis process, Doris uses multi-threaded parallel scanning across multiple disk files, which generates substantial disk IO that can impact other queries (such as report analysis). By using Workload Groups, Doris can group offline ETL data processing and online report queries separately, limiting the offline data processing IO bandwidth. This helps reduce the impact of offline data processing on online report analysis.
Test environment
1 FE, 1 BE, 96-core machine. Dataset: clickbench. Test query: q29.
Testing without enabling IO hard limits
-
Clear Cache.
// clear OS cache
sync; echo 3 > /proc/sys/vm/drop_caches
// disable BE page cache
disable_storage_page_cache = true -
Perform a full table scan on the clickbench table, and execute a single concurrent query.
set dry_run_query = true;
select * from hits.hits; -
Check the maximum throughput of the current Group as 3GB per second through Doris's system table.
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 1146.6208400726318 |
+--------------------+
1 row in set (0.03 sec)
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 3496.2762966156006 |
+--------------------+
1 row in set (0.04 sec)
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 2192.7690029144287 |
+--------------------+
1 row in set (0.02 sec) -
Use the pidstat command to check the process IO. The first column is the process ID, and the second column is the read IO throughput (in kb/s). It can be seen that when IO is not restricted, the maximum throughput is 2GB per second.
Test after enabling IO hard limit
-
Clear cache.
// Clear OS cache.
sync; echo 3 > /proc/sys/vm/drop_caches
// disable BE page cache
disable_storage_page_cache = true -
Modify the Workload Group configuration to limit the maximum throughput to 100M per second.
alter workload group g2 properties('read_bytes_per_second'='104857600');
-
Use Doris system tables to check that the maximum IO throughput of the Workload Group is 98M per second.
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 97.94296646118164 |
+--------------------+
1 row in set (0.03 sec)
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 98.37584781646729 |
+--------------------+
1 row in set (0.04 sec)
mysql [information_schema]>select LOCAL_SCAN_BYTES_PER_SECOND / 1024 / 1024 as mb_per_sec from workload_group_resource_usage where WORKLOAD_GROUP_ID=11201;
+--------------------+
| mb_per_sec |
+--------------------+
| 98.06641292572021 |
+--------------------+
1 row in set (0.02 sec) -
Use the pid tool to check that the maximum IO throughput of the process is 131M per second.
Note
-
The LOCAL_SCAN_BYTES_PER_SECOND field in the system table represents the summary value of the current Workload Group's statistics at the process level. For example, if 12 file paths are configured, LOCAL_SCAN_BYTES_PER_SECOND is the maximum IO value of these 12 file paths. If you wish to view the IO throughput for each file path separately, you can check the detailed values in Grafana.
-
Due to the presence of the operating system and Doris's Page Cache, the IO observed through Linux's IO monitoring scripts is typically smaller than the IO seen in the system table.
Limit remote IOβ
BrokerLoad and S3Load are commonly used methods for large-scale data load. Users can first upload data to HDFS or S3, and then use BrokerLoad and S3Load to load data in parallel. To speed up the load process, Doris uses multi-threading to pull data from HDFS/S3, which can generate significant pressure on HDFS/S3, potentially making other jobs running on HDFS/S3 unstable.
To mitigate the impact on other workloads, the Workload Group's remote IO limit feature can be used to restrict the bandwidth used during the load process from HDFS/S3. This helps reduce the impact on other business operations.
Test environment
1 FE and 1 BE are deployed on the same machine, configured with 16 cores and 64GB of memory. The test data is the clickbench dataset, and before testing, we need to upload the dataset to S3. Considering the upload time, we will only upload 10 million rows of data, and then use the TVF function to query the data from S3.
After the upload is successful, you can use the command to view the schema information.
```sql
DESC FUNCTION s3 (
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
);
```
Test without restricting remote read IO
-
Initiate a single-threaded test to perform a full table scan on the clickbench table.
// Set the operation to only scan the data without returning results.
set dry_run_query = true;
SELECT * FROM s3(
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
); -
Use the system table to check the current remote IO throughput. It shows that the remote IO throughput for this query is 837 MB per second. Note that the actual IO throughput here is highly dependent on the environment. If the machine hosting the BE has limited bandwidth to the external storage, the actual throughput may be lower.
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 837 |
+---------+
1 row in set (0.104 sec)
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 867 |
+---------+
1 row in set (0.070 sec)
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 867 |
+---------+
1 row in set (0.186 sec) -
Use the sar command (sar -n DEV 1 3600) to monitor the machine's network bandwidth. It shows that the maximum network bandwidth at the machine level is 1033 MB per second. The first column of the output represents the number of bytes received per second by a specific network interface on the machine, in KB per second.
Test limiting remote read IO
-
Modify the Workload Group configuration to limit remote read IO throughput to 100M per second.
alter workload group normal properties('remote_read_bytes_per_second'='104857600');
-
Initiate a single concurrent full table scan query.
set dry_run_query = true;
SELECT * FROM s3(
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
); -
Use the system table to check the current remote read IO throughput. At this time, the IO throughput is around 100M, with some fluctuations. These fluctuations are influenced by the current algorithm design, typically peaking briefly without persisting for long periods, which is considered normal.
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 56 |
+---------+
1 row in set (0.010 sec)
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 131 |
+---------+
1 row in set (0.009 sec)
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage;
+---------+
| read_mb |
+---------+
| 111 |
+---------+
1 row in set (0.009 sec) -
Use the sar command (sar -n DEV 1 3600) to monitor the current network card's received traffic. The first column represents the amount of data received per second. The maximum value observed is now 207M per second, indicating that the read IO limit is effective. However, since the sar command reflects machine-level traffic, the observed value is slightly higher than what Doris reports.