Aggregate Functions in Confluent Cloud for Apache Flink¶
Confluent Cloud for Apache Flink®️ provides these built-in functions to aggregate rows in Flink SQL queries:
AVG | COLLECT | COUNT | CUME_DIST |
DENSE_RANK | FIRST_VALUE | LAG | LAST_VALUE |
LEAD | LISTAGG | MAX | MIN |
NTILE | PERCENT_RANK | RANK | ROW_NUMBER |
STDDEV_POP | STDDEV_SAMP | SUM | VAR_POP |
VAR_SAMP | VARIANCE |
The aggregate functions take an expression across all the rows as the input and return a single aggregated value as the result.
AVG¶
- Syntax
AVG([ ALL | DISTINCT ] expression)
- Description
By default or with keyword
ALL
, returns the average (arithmetic mean) ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.- Example
-- returns 1.500000 SELECT AVG(my_values) FROM (VALUES (0.0), (1.0), (2.0), (3.0)) AS my_values;
COLLECT¶
- Syntax
COLLECT([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns a multiset ofexpression
over all input rows.NULL values are ignored.
Use
DISTINCT
to return one unique instance of each value.
COUNT¶
- Syntax
COUNT([ ALL ] expression | DISTINCT expression1 [, expression2]*)
- Description
By default or with
ALL
, returns the number of input rows for which expression isn’t NULL.Use
DISTINCT
to return one unique instance of each value.Use
COUNT(*)
orCOUNT(1)
to return the number of input rows.- Example
-- returns 4 SELECT COUNT(my_values) FROM (VALUES (0), (1), (2), (3)) AS my_values;
CUME_DIST¶
- Syntax
CUME_DIST()
- Description
- Returns the cumulative distribution of a value in a group of values. The result is the number of rows preceding or equal to the current row in the partition ordering divided by the number of rows in the window partition.
DENSE_RANK¶
FIRST_VALUE¶
- Syntax
FIRST_VALUE(expression)
- Description
- Returns the first value in an ordered set of values.
- Example
-- returns first SELECT FIRST_VALUE(my_values) FROM (VALUES ('first'), ('second'), ('third')) AS my_values;
- Related function
LAG¶
- Syntax
LAG(expression [, offset] [, default])
- Description
Returns the value of expression at the offsetth row before the current row in the window.
The default value of
offset
is 1, and the default value of thedefault
argument is NULL.- Example
The following example shows how to use the LAG function to see player scores changing over time.
SELECT $rowtime AS row_time , player_id , game_room_id , points , LAG(points, 1) OVER (PARTITION BY player_id ORDER BY $rowtime) previous_points_value FROM gaming_player_activity;
For the full code example, see Compare Current and Previous Values in a Data Stream.
- Related function
LAST_VALUE¶
- Syntax
LAST_VALUE(expression)
- Description
- Returns the last value in an ordered set of values.
- Example
-- returns third SELECT LAST_VALUE(my_values) FROM (VALUES ('first'), ('second'), ('third')) AS my_values;
- Related function
LEAD¶
- Syntax
LEAD(expression [, offset] [, default])
- Description
Returns the value of the expression at the offsetth row after the current row in the window.
The default value of
offset
is 1, and the default value of thedefault
argument is NULL.- Related function
LISTAGG¶
- Syntax
LISTAGG(expression [, separator])
- Description
Concatenates the values of string expressions and inserts separator values between them.
The separator isn’t added at the end of string.
The default value of separator is
','
.- Example
-- returns first,second,third SELECT LISTAGG(my_values) FROM (VALUES ('first'), ('second'), ('third')) AS my_values;
MAX¶
- Syntax
MAX([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns the maximum value ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.- Examples
-- returns 3 SELECT MAX(my_values) FROM (VALUES (0), (1), (2), (3)) AS my_values;
The following example shows how to use the MAX function to find the highest player score in a tumbling window.
SELECT window_start, window_end, SUM(points) AS total, MIN(points) as min_points, MAX(points) as max_points FROM TABLE(TUMBLE(TABLE gaming_player_activity_source, DESCRIPTOR($rowtime), INTERVAL '10' SECOND)) GROUP BY window_start, window_end;
For the full code example, see Aggregate a Stream in a Tumbling Window.
- Related function
MIN¶
- Syntax
MIN([ ALL | DISTINCT ] expression )
- Description
By default or with the
ALL
keyword, returns the minimum value ofexpression
across all input rows.Use
DISTINCT
to return one unique instance of each value.- Examples
-- returns 0 SELECT MIN(my_values) FROM (VALUES (0), (1), (2), (3)) AS my_values;
The following example shows how to use the MIN function to find the lowest player score in a tumbling window.
SELECT window_start, window_end, SUM(points) AS total, MIN(points) as min_points, MAX(points) as max_points FROM TABLE(TUMBLE(TABLE gaming_player_activity_source, DESCRIPTOR($rowtime), INTERVAL '10' SECOND)) GROUP BY window_start, window_end;
For the full code example, see Aggregate a Stream in a Tumbling Window.
- Related function
NTILE¶
- Syntax
NTILE(n)
- Description
Divides the rows for each window partition into
n
buckets ranging from 1 to at mostn
.If the number of rows in the window partition doesn’t divide evenly into the number of buckets, the remainder values are distributed one per bucket, starting with the first bucket.
For example, with 6 rows and 4 buckets, the bucket values would be:
1 1 2 2 3 4
PERCENT_RANK¶
- Syntax
PERCENT_RANK()
- Description
Returns the percentage ranking of a value in a group of values.
The result is the rank value minus one, divided by the number of rows in the parition minus one.
If the partition only contains one row, the
PERCENT_RANK
function returns 0.
RANK¶
- Syntax
RANK()
- Description
Returns the rank of a value in a group of values.
The result is one plus the number of rows preceding or equal to the current row in the partition ordering.
The values produce gaps in the sequence.
- Related functions
ROW_NUMBER¶
- Syntax
ROW_NUMBER()
- Description
Assigns a unique, sequential number to each row, starting with one, according to the ordering of rows within the window partition.
The
ROW_NUMBER
andRANK
functions are similar.ROW_NUMBER
numbers all rows sequentially, for example,1, 2, 3, 4, 5
.RANK
provides the same numeric value for ties, for example1, 2, 2, 4, 5
.- Related functions
STDDEV_POP¶
- Syntax
STDDEV_POP([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns the population standard deviation ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.- Example
-- returns 0.986154 SELECT STDDEV_POP(my_values) FROM (VALUES (0.5), (1.5), (2.2), (3.2)) AS my_values;
- Related function
STDDEV_SAMP¶
- Syntax
STDDEV_SAMP([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns the sample standard deviation ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.- Example
-- returns 1.138713 SELECT STDDEV_SAMP(my_values) FROM (VALUES (0.5), (1.5), (2.2), (3.2)) AS my_values;
- Related function
SUM¶
- Syntax
SUM([ ALL | DISTINCT ] expression)
By default or with the
ALL
keyword, returns the sum ofexpression
across all input rows.Use
DISTINCT
to return one unique instance of each value.- Examples
-- returns 6 SELECT SUM(my_values) FROM (VALUES (0), (1), (2), (3)) AS my_values;
The following example shows how to use the SUM function to find the total of player scores in a tumbling window.
SELECT window_start, window_end, SUM(points) AS total, MIN(points) as min_points, MAX(points) as max_points FROM TABLE(TUMBLE(TABLE gaming_player_activity_source, DESCRIPTOR($rowtime), INTERVAL '10' SECOND)) GROUP BY window_start, window_end;
For the full code example, see Aggregate a Stream in a Tumbling Window.
VAR_POP¶
- Syntax
VAR_POP([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns the population variance, which is the square of the population standard deviation, ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.- Example
-- returns 0.972500 SELECT VAR_POP(my_values) FROM (VALUES (0.5), (1.5), (2.2), (3.2)) AS my_values;
- Related function
VAR_SAMP¶
- Syntax
VAR_SAMP([ ALL | DISTINCT ] expression)
- Description
By default or with the
ALL
keyword, returns the sample variance, which is the square of the sample standard deviation, ofexpression
over all input rows.Use
DISTINCT
to return one unique instance of each value.The
VARIANCE
function is equivalent toVAR_SAMP
.- Example
-- returns 1.296667 SELECT VAR_SAMP(my_values) FROM (VALUES (0.5), (1.5), (2.2), (3.2)) AS my_values;
- Related functions