Skip to main content
Skip to main content

Multi-Dimensional Analytics

In a database, ROLLUP, CUBE, and GROUPING SETS are advanced SQL statements used for multidimensional data aggregation. These features significantly enhance the capabilities of the GROUP BY clause, enabling users to obtain multiple levels of summary results in a single query, which is semantically equivalent to using UNION ALL to connect multiple aggregation statements.

  • ROLLUP: ROLLUP is an operation used to generate hierarchical summaries. It aggregates data according to the specified column order, gradually summarizing from the finest granularity to the highest level. For example, in sales data, ROLLUP can be used to summarize by region and time, providing sales for each region per month, total sales for each region, and overall total sales. ROLLUP is suitable for scenarios requiring step-by-step summaries.

  • CUBE: CUBE is a more powerful aggregation operation that generates all possible summary combinations. Unlike ROLLUP, CUBE calculates subsets for all dimensions. For instance, for sales data aggregated by product and region, CUBE will compute sales for each product in each region, total sales for each product, total sales for each region, and overall total sales. CUBE is applicable to scenarios requiring comprehensive multidimensional analysis, such as business analysis and market research.

  • GROUPING SETS: GROUPING SETS offer flexibility in aggregating specific grouping sets. It allows users to specify a set of column combinations for independent aggregation, rather than generating all possible combinations as in ROLLUP and CUBE. For example, one can define summaries for specific combinations of region and time without needing all combinations of each dimension. GROUPING SETS are suitable for scenarios requiring customized summaries, providing flexible aggregation control.

ROLLUP, CUBE, and GROUPING SETS provide powerful multidimensional data summary functions, catering to various data analysis and reporting needs, and making complex aggregation calculations simpler and more efficient. The following sections will detail the usage scenarios, syntax, and examples of these features.

ROLLUP

Use Case

ROLLUP is particularly useful for summarizing data along hierarchical dimensions such as time, geography, and category. For instance, queries can specify ROLLUP(year, month, day) or (country, Province, city).

Syntax and Example

The syntax for ROLLUP is as follows:

SELECTGROUP BY ROLLUP(grouping_column_reference_list)

Here's an example query that analyzes sales sums by year and month:

SELECT  
YEAR(d_date),
MONTH(d_date),
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1
WHERE
d1.d_date_sk = ss_sold_date_sk
AND YEAR(d_date) IN (2001, 2002)
AND MONTH(d_date) IN (1, 2, 3)
GROUP BY
ROLLUP(YEAR(d_date), MONTH(d_date))
ORDER BY
YEAR(d_date), MONTH(d_date);

This query summarizes data by time, calculating subtotals for sales by year, sales by month within each year, and the grand total of sales. The query result is as follows:

+--------------+---------------+-------------+  
| YEAR(d_date) | MONTH(d_date) | total_sum |
+--------------+---------------+-------------+
| NULL | NULL | 54262669.17 |
| 2001 | NULL | 26640320.46 |
| 2001 | 1 | 9982165.83 |
| 2001 | 2 | 8454915.34 |
| 2001 | 3 | 8203239.29 |
| 2002 | NULL | 27622348.71 |
| 2002 | 1 | 11260654.35 |
| 2002 | 2 | 7722750.61 |
| 2002 | 3 | 8638943.75 |
+--------------+---------------+-------------+
9 rows in set (0.08 sec)

CUBE

Use Case

CUBE is best suited for queries involving columns from multiple independent dimensions, rather than columns representing different levels of a single dimension. For example, a common usage scenario is summarizing all combinations of month, region, and product. These are three independent dimensions, and it is common to analyze all possible subtotal combinations. By contrast, cross-tabulating all possible combinations of year, month, and day would include several unnecessary values due to the natural hierarchy in the time dimension. In most analyses, subtotals like profits calculated by month and day are unnecessary. Relatively few users need to ask, "What is the total sales for the 16th of each month throughout the year?"

Syntax and Example

The syntax for CUBE is as follows:

SELECTGROUP BY CUBE(grouping_column_reference_list)

Example usage:

SELECT  
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ("Books", "Electronics")
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ("LA", "AK")
GROUP BY CUBE(YEAR(d_date), i_category, ca_state)
ORDER BY YEAR(d_date), i_category, ca_state;

The query result is as follows, calculating:

  • The total sales sum;

  • Subtotals for sales by year, by product category, and by state;

  • Subtotals for sales by product category within each year, by state for each product, by state within each year, and by product category within each state and year.

+--------------+-------------+----------+------------+  
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| NULL | Books | NULL | 4238177.69 |
| NULL | Books | AK | 1310791.36 |
| NULL | Books | LA | 2927386.33 |
| NULL | Electronics | NULL | 4452196.91 |
| NULL | Electronics | AK | 1364406.97 |
| NULL | Electronics | LA | 3087789.94 |
| 1998 | NULL | NULL | 4369656.14 |
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | NULL | 4320718.46 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
27 rows in set (0.21 sec)

GROUPING FUNCTION

This section introduces how to address two challenges when using ROLLUP and CUBE:

  1. How to programmatically identify which rows in the result set represent subtotals and accurately determine the aggregation level corresponding to a given subtotal. Since subtotals are often needed for calculations such as percentage of totals, we require a convenient method to identify these subtotal rows.

  2. When the query results contain both actual stored NULL values and "NULL" values generated by ROLLUP or CUBE operations, another problem arises: how to distinguish between these two types of NULL values?

GROUPING, GROUPING_ID, and GROUPING SETS can effectively solve the aforementioned challenges.

GROUPING

1. Principles

GROUPING uses a single column as a parameter and returns 1 when encountering a NULL value created by ROLLUP or CUBE operations, indicating that the row is a subtotal. Any other type of value (including NULLs inherently present in the table data) returns 0.

Example:

select  
year(d_date),
month(d_date),
sum(ss_net_paid) as total_sum,
grouping(year(d_date)),
grouping(month(d_date))
from
store_sales,
date_dim d1
where
d1.d_date_sk = ss_sold_date_sk
and year(d_date) in (2001, 2002)
and month(d_date) in (1, 2, 3)
group by
rollup(year(d_date), month(d_date))
order by
year(d_date), month(d_date);
  • The GROUPING function result for the (YEAR(d_date), MONTH(d_date)) group is (0,0) for aggregation by year and month.

  • The GROUPING function result for the (YEAR(d_date)) group is (0,1) for aggregation by year.

  • The GROUPING function result for the () group is (1,1) for the total aggregation.

Query result:

+--------------+---------------+-------------+------------------------+-------------------------+  
| year(d_date) | month(d_date) | total_sum | Grouping(year(d_date)) | Grouping(month(d_date)) |
+--------------+---------------+-------------+------------------------+-------------------------+
| NULL | NULL | 54262669.17 | 1 | 1 |
| 2001 | NULL | 26640320.46 | 0 | 1 |
| 2001 | 1 | 9982165.83 | 0 | 0 |
| 2001 | 2 | 8454915.34 | 0 | 0 |
| 2001 | 3 | 8203239.29 | 0 | 0 |
| 2002 | NULL | 27622348.71 | 0 | 1 |
| 2002 | 1 | 11260654.35 | 0 | 0 |
| 2002 | 2 | 7722750.61 | 0 | 0 |
| 2002 | 3 | 8638943.75 | 0 | 0 |
+--------------+---------------+-------------+------------------------+-------------------------+
9 rows in set (0.06 sec)

2. Usage Scenarios, Syntax, and Examples

The GROUPING function can be used to filter results. Example:

select
year(d_date),
i_category,
ca_state,
sum(ss_net_paid) as total_sum
from
store_sales,
date_dim d1,
item,
customer_address ca
where
d1.d_date_sk = ss_sold_date_sk
and i_item_sk = ss_item_sk
and ss_addr_sk=ca_address_sk
and i_category in ("Books", "Electronics")
and year(d_date) in(1998, 1999)
and ca_state in ("LA", "AK")
group by cube(year(d_date), i_category, ca_state)
having grouping(year(d_date))=1 and grouping(i_category)=1 and grouping(ca_state)=1
or grouping(year(d_date))=0 and grouping(i_category)=1 and grouping(ca_state)=1
or grouping(year(d_date))=1 and grouping(i_category)=1 and grouping(ca_state)=0
order by year(d_date), i_category, ca_state;

Using the GROUPING function in the HAVING clause retains only the total sales, sales summarized by year, and sales summarized by region. Query result:

+---------------------+------------+----------+------------+  
| year(`d1`.`d_date`) | i_category | ca_state | total_sum |
+---------------------+------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| 1998 | NULL | NULL | 4369656.14 |
| 1999 | NULL | NULL | 4320718.46 |
+---------------------+------------+----------+------------+
5 rows in set (0.13 sec)

You can also use the GROUPING function with the IF function to enhance query readability. Example:

select  
if(grouping(year(d_date)) = 1, "Multi-year sum", year(d_date)) as year,
if(grouping(i_category) = 1, "Multi-category sum", i_category) as category,
sum(ss_net_paid) as total_sum
from
store_sales,
date_dim d1,
item,
customer_address ca
where
d1.d_date_sk = ss_sold_date_sk
and i_item_sk = ss_item_sk
and ss_addr_sk = ca_address_sk
and i_category in ("Books", "Electronics")
and year(d_date) in (1998, 1999)
and ca_state in ("LA", "AK")
group by cube(year(d_date), i_category)

Query result:

+----------------+--------------------+------------+  
| year | category | total_sum |
+----------------+--------------------+------------+
| 1998 | Books | 2213703.82 |
| 1998 | Electronics | 2155952.32 |
| 1999 | Electronics | 2296244.59 |
| 1999 | Books | 2024473.87 |
| 1998 | Multi-category sum | 4369656.14 |
| 1999 | Multi-category sum | 4320718.46 |
| Multi-year sum | Books | 4238177.69 |
| Multi-year sum | Electronics | 4452196.91 |
| Multi-year sum | Multi-category sum | 8690374.60 |
+----------------+--------------------+------------+
9 rows in set (0.09 sec)

GROUPING_ID

1. Usage Scenarios

In a database, both GROUPING_ID and GROUPING functions serve as auxiliary functions for handling multidimensional data aggregation queries, such as ROLLUP and CUBE, aiding users in distinguishing between different levels of aggregation results. If you wish to determine the aggregation level of a particular row, you need to use the GROUPING function to compute all GROUP BY columns, as the computation result of a single column alone is insufficient.

The GROUPING_ID function is more powerful than GROUPING because it can detect multiple columns simultaneously. The GROUPING_ID function accepts multiple columns as parameters and returns an integer that represents the aggregation status of these columns through binary bits. When using tables or materialized views to store computation results, using GROUPING to represent different levels of aggregation can consume considerable storage space. In such scenarios, GROUPING_ID is more appropriate.

Taking CUBE(a, b) as an example, its GROUPING_ID can be represented as follows:

Aggregation LevelBit VectorGROUPING_IDGROUPING(a)GROUPING(b)
a,b0 0000
a0 1101
b1 0210
Grand Total1 1311

2. Syntax and Example

Here is an example SQL query:

SELECT    
year(d_date),
i_category,
SUM(ss_net_paid) AS total_sum,
GROUPING(year(d_date)),
GROUPING(i_category),
GROUPING_ID(year(d_date), i_category)
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND year(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(year(d_date), i_category);

The query results are as follows:

+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+    
| year(d_date) | i_category | total_sum | GROUPING(year(d_date)) | GROUPING(i_category) | GROUPING_ID(year(d_date), i_category) |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| 1998 | Electronics | 2155952.32 | 0 | 0 | 0 |
| 1998 | Books | 2213703.82 | 0 | 0 | 0 |
| 1999 | Electronics | 2296244.59 | 0 | 0 | 0 |
| 1999 | Books | 2024473.87 | 0 | 0 | 0 |
| 1998 | NULL | 4369656.14 | 0 | 1 | 1 |
| 1999 | NULL | 4320718.46 | 0 | 1 | 1 |
| NULL | Electronics | 4452196.91 | 1 | 0 | 2 |
| NULL | Books | 4238177.69 | 1 | 0 | 2 |
| NULL | NULL | 8690374.60 | 1 | 1 | 3 |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
9 rows in set (0.12 sec)

GROUPING SETS

1. Usage Scenarios

When there is a need to selectively specify the group sets to create, the GROUPING SETS expression can be used in the GROUP BY clause. This method allows users to precisely specify across multiple dimensions without computing the entire CUBE.

Since CUBE queries typically consume significant resources, using GROUPING SETS can enhance query execution efficiency when only a few dimensions are of interest.

2. Syntax and Examples

The syntax for GROUPING SETS is as follows:

SELECTGROUP BY GROUPING SETS(grouping_column_reference_list)

If you need:

  • Subtotals of sales for each product category per year

  • Subtotals of sales for each state per year

  • Subtotals of sales for each product in each state per year

You can use GROUPING SETS to specify these dimensions and perform the aggregation. Here is an example:

SELECT  
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY GROUPING SETS(
(YEAR(d_date), i_category),
(YEAR(d_date), ca_state),
(YEAR(d_date), ca_state, i_category)
)
ORDER BY YEAR(d_date), i_category, ca_state;

Query Result:

+--------------+-------------+----------+------------+  
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
16 rows in set (0.11 sec)

The above approach is equivalent to using CUBE but specifies concrete grouping_id, thereby reducing unnecessary calculations:

SELECT  
SUM(ss_net_paid) AS total_sum,
YEAR(d_date),
i_category,
ca_state
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(YEAR(d_date), ca_state, i_category)
HAVING grouping_id(YEAR(d_date), ca_state, i_category) = 0
OR grouping_id(YEAR(d_date), ca_state, i_category) = 2
OR grouping_id(YEAR(d_date), ca_state, i_category) = 1;
Note

Using CUBE computes all possible aggregation levels (eight in this case), but in practice, you may only be interested in a few of them.

3. Semantic Equivalents

  • GROUPING SETS vs. GROUP BY UNION ALL

    The GROUPING SETS statement:

    SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());

    Is equivalent in query results to multiple GROUP BY queries connected with UNION ALL:

    SELECT k1, k2, SUM(k3) FROM t GROUP BY k1, k2  
    UNION ALL
    SELECT k1, NULL, SUM(k3) FROM t GROUP BY k1
    UNION ALL
    SELECT NULL, k2, SUM(k3) FROM t GROUP BY k2
    UNION ALL
    SELECT NULL, NULL, SUM(k3) FROM t;

    Using UNION ALL results in a longer query and requires multiple scans of the base table, making it less efficient in both writing and execution.

  • GROUPING SETS vs. ROLLUP

    ROLLUP is an extension of GROUPING SETS. For example:

    SELECT a, b, c, SUM(d) FROM tab1 GROUP BY ROLLUP(a, b, c);

    This ROLLUP is equivalent to the following GROUPING SETS:

    GROUPING SETS (  
    (a, b, c),
    (a, b),
    (a),
    ()
    );
  • GROUPING SETS vs. CUBE

    CUBE(a, b, c) is equivalent to the following GROUPING SETS:

    GROUPING SETS (  
    (a, b, c),
    (a, b),
    (a, c),
    (a),
    (b, c),
    (b),
    (c),
    ()
    );