Group Aggregation Queries in Confluent Cloud for Apache Flink

Confluent Cloud for Apache Flink® enables computing a single result from multiple input rows in a Flink SQL table.

Description

Compute a single result from multiple input rows in a table.

Like most data systems, Apache Flink® supports aggregate functions. An aggregate function computes a single result from multiple input rows. For example, there are aggregates to compute the COUNT, SUM, AVG (average), MAX (maximum) and MIN (minimum) values over a set of rows.

The following example shows how to count the number of rows in a table, by using the COUNT function.

SELECT COUNT(*) FROM orders

For streaming queries, Flink runs continuous queries that never terminate. A continuous query updates the result table according to the updates on its input tables. For the previous query, Flink outputs an updated count each time a new row is inserted into the orders table.

GROUP BY clause

Flink SQL supports the standard GROUP BY clause for aggregating data.

The following example shows how to count the number of rows in a table and group the results by a table column.

SELECT COUNT(*)
FROM orders
GROUP BY order_id

For streaming queries, the required state for computing the query result might grow indefinitely. State size depends on the number of groups and the number and type of aggregation functions.

For example, MIN and MAX are heavy on state size while COUNT is inexpensive.

DISTINCT Aggregation

Distinct aggregates remove duplicate values before applying an aggregation function.

The following example counts the number of distinct order_ids instead of the total number of rows in an orders table.

SELECT COUNT(DISTINCT order_id) FROM orders

For streaming queries, the required state for computing the query result might grow indefinitely. State size depends primarily on the number of distinct rows and the time that a group is maintained. Short-lived GROUP BY windows are not a problem.

GROUPING SETS

Grouping sets enable more complex grouping operations than those you can describe with a standard GROUP BY clause. Rows are grouped separately by each specified grouping set, and aggregates are computed for each group just as for simple GROUP BY clauses.

The following example show how to use GROUPING SETS to

SELECT supplier_id, rating, COUNT(*) AS total
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SETS ((supplier_id, rating), (supplier_id), ())

Results:

+-------------+--------+-------+
| supplier_id | rating | total |
+-------------+--------+-------+
|   supplier1 |      4 |     1 |
|   supplier1 | (NULL) |     2 |
|      (NULL) | (NULL) |     4 |
|   supplier1 |      3 |     1 |
|   supplier2 |      3 |     1 |
|   supplier2 | (NULL) |     2 |
|   supplier2 |      4 |     1 |
+-------------+--------+-------+

Each sublist of GROUPING SETS specifies zero or more columns or expressions and is interpreted as if it were used directly in the GROUP BY clause. An empty grouping set means that all rows are aggregated down to a single group, which is output even if no input rows were present.

References to the grouping columns or expressions are replaced by null values in result rows for grouping sets in which those columns don’t appear.

For streaming queries, the required state for computing the query result might grow indefinitely. State size depends on the number of group sets and type of aggregation functions.

ROLLUP

ROLLUP is a shorthand notation for specifying a common type of grouping set. It represents the given list of expressions and all prefixes of the list, including the empty list.

For example, the following query is equivalent to the previous GROUP BY GROUPING SETS query.

SELECT supplier_id, rating, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY ROLLUP (supplier_id, rating)

CUBE

CUBE is a shorthand notation for specifying a common type of grouping set. It represents the given list and all of its possible subsets, which is also known as the power set.

For example, the following two queries are equivalent.

SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY CUBE (supplier_id, rating, product_id)

SELECT supplier_id, rating, product_id, COUNT(*)
FROM (VALUES
    ('supplier1', 'product1', 4),
    ('supplier1', 'product2', 3),
    ('supplier2', 'product3', 3),
    ('supplier2', 'product4', 4))
AS Products(supplier_id, product_id, rating)
GROUP BY GROUPING SET (
    ( supplier_id, product_id, rating ),
    ( supplier_id, product_id         ),
    ( supplier_id,             rating ),
    ( supplier_id                     ),
    (              product_id, rating ),
    (              product_id         ),
    (                          rating ),
    (                                 )
)

HAVING

The HAVING clause eliminates group rows that don’t satisfy the specified condition.

HAVING is distinct from the WHERE clause, because WHERE filters individual rows before the GROUP BY, while HAVING filters group rows created by GROUP BY. Each column referenced in the condition must unambiguously reference a grouping column, unless it appears within an aggregate function.

SELECT SUM(amount)
FROM orders
GROUP BY users
HAVING SUM(amount) > 50

The presence of a HAVING clause turns a query into a grouped query, even if there is no GROUP BY clause. It’s the same as what happens when the query contains aggregate functions but no GROUP BY clause. The query considers all selected rows to form a single group, and the SELECT list and HAVING clause can reference only table columns from within aggregate functions. Such a query emits a single row if the HAVING condition is true, and zero rows if it’s not true.