Aggregate Functions ksqlDB for Confluent Platform

Important

In an aggregation function, providing a * character or an empty argument list causes the function to return only the ROWTIME column. For example, calling AVG(*) or AVG() returns the average of ROWTIME.

`AVG`

Applies to: stream, table
Since: 0.6.0

AVG(col1)

Returns the average value for col1.

Tip

See AVG in action:

Build a dynamic pricing strategy

`COLLECT_LIST`

Applies to: stream, table

COLLECT_LIST(col1) => ARRAY

Returns an array containing all the values of col1 from each input row (for the specified grouping and time window, if any).

The size of the result ARRAY can be limited to a maximum of ksql.functions.collect_list.limit entries, and any values beyond this limit are ignored silently.

Note

In Confluent Cloud, the ksql.functions.collect_list.limit config is set to 1000 and can’t be changed.

When used with SESSION window, sometimes two session windows are merged together into one, when a out-of-order record with a timestamp between the two windows is processed. In this case, the record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

Tip

See COLLECT_LIST in action:

Automate instant payment verifications

`COLLECT_SET`

Applies to: stream

COLLECT_SET(col1) => ARRAY

Returns an array containing the distinct values of col1 from each input row (for the specified grouping and time window, if any).

The size of the result ARRAY can be limited to a maximum of ksql.functions.collect_set.limit entries, and any values beyond this limit are ignored silently.

Note

In Confluent Cloud, the ksql.functions.collect_set.limit config is set to 1000 and can’t be changed.

When used with a SESSION window, sometimes two session windows are merged together into one, when a out-of-order record with a timestamp between the two windows is processed. In this case, the record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

`CORRELATION`

Applies to: stream, table
Since: 0.29.0

CORRELATION(x, y)

Returns the Pearson correlation coefficient between columns x and y. If either value in x or y is NULL for a particular row, that row is ignored.

If all rows contain NULL for x or y, or if there is only one non-null row, NaN is returned. When there are only two non-null rows, either 1.0 or -1.0 is returned, depending on the sign of the slope of the line that would be drawn between the two points.

If all the values in one column are equal, NaN is returned.

`COUNT`

Applies to: stream, table

COUNT(col1) COUNT(*)

Counts the number of rows.

When col1 is specified, the count returned is the number of rows where col1 is non-null.

When * is specified, the count returned is the total number of rows.

Tip

See COUNT in action:

`COUNT_DISTINCT`

Applies to: stream
Since: 0.7.0

COUNT_DISTINCT(col1)

Returns the approximate number of unique values of col1 in a group.

The function implementation uses HyperLogLog to estimate cardinalities of 10^9 with a typical standard error of 2 percent.

`EARLIEST_BY_OFFSET`

Applies to: stream
Since: 0.10.0

EARLIEST_BY_OFFSET(col1, [ignoreNulls])

Return the earliest value for the specified column.

The earliest value in the partition has the lowest offset.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether NULL values are ignored. The default is to ignore NULL values.

sql title="Applies to: stream<br>Since: 0.13.0" EARLIEST_BY_OFFSET(col1, earliestN, [ignoreNulls])

Returns the earliest N values for the specified column as an ARRAY.

The earliest values in the partition have the lowest offsets.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether NULL values are ignored. The default is to ignore NULL values.

`HISTOGRAM`

Applies to: stream, table

HISTOGRAM(col1)

Returns a map containing the distinct string values of col1 mapped to the number of times each one occurs for the given window.

The HISTOGRAM function limits the number of distinct values which can be counted to 1000, beyond which any additional entries are ignored.

When used with a SESSION window, sometimes two session windows are merged together into one, when a out-of-order record with a timestamp between the two windows is processed. In this case, the 1000-record limit is calculated by first considering all the records from the first window, then the out-of-order record, then the records from the second window in the order they were originally processed.

Tip

See HISTOGRAM in action:

Automate instant payment verifications

`LATEST_BY_OFFSET`

Applies to: stream
Since: 0.8.0

LATEST_BY_OFFSET(col1, [ignoreNulls])

Returns the latest value for the specified column.

The latest value in the partition has the largest offset.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether NULL values are ignored. The default is to ignore NULL values.

sql title="Applies to: stream<br>Since: 0.13.0" LATEST_BY_OFFSET(col1, latestN, [ignoreNulls])

Returns the latest N values for the specified column as an ARRAY.

The latest values have the largest offset.

The optional ignoreNulls parameter, available since version 0.13.0, controls whether NULL values are ignored. The default is to ignore NULL values.

`MAX`

Applies to: stream

MAX(col1)

Returns the maximum value for a given column and window.

Rows that have col1 set to NULL are ignored.

Tip

See MAX in action:

`MIN`

Applies to: stream

MIN(col1)

Returns the minimum value for a given column and window.

Rows that have col1 set to NULL are ignored.

Tip

See MIN in action:

Build a dynamic pricing strategy

`STDDEV_SAMP`

Applies to: stream, table
Since: 0.16.0

STDDEV_SAMP(col1)

Returns the sample standard deviation for the column.

This function returns the square of the standard deviation instead of the standard deviation. Use the STDDEV_SAMPLE function to compute the standard deviation.

`STDDEV_SAMPLE`

Applies to: stream, table
Since: 0.25.0

STDDEV_SAMP(col1)

Returns the sample standard deviation for the column.

`SUM`

Applies to: stream, table

SUM(col1)

Sums the column values.

Rows that have col1 set to NULL are ignored.

Tip

See SUM in action:

`TOPK`

Applies to: stream

TOPK(col1, otherCols..., k)``

Returns the Top K values for the given column and window.

If only col1 is provided, an array of values for that column is returned. If otherCols are provided, a list of STRUCTs is returned. Each STRUCT has a field named sort_col that contains value of col1 in the associated record. otherCols are in fields named col0, col1, col2, etc., in the order they were provided. otherCols do not all have to be of the same type.

Rows that have col1 set to NULL are ignored.

Example

SELECT orderzip_code, TOPK(order_total, 5)
FROM orders WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY order_zipcode
EMIT CHANGES;

`TOPKDISTINCT`

Applies to: stream

TOPKDISTINCT(col1, k)``

Returns the distinct Top K values for the given column and window.

Rows that have col1 set to NULL are ignored.

Example

SELECT pageid, TOPKDISTINCT(viewtime, 5)
FROM pageviews_users
GROUP BY pageid
EMIT CHANGES;``

Aggregate Functions ksqlDB for Confluent Platform

AVG

COLLECT_LIST

COLLECT_SET

CORRELATION

COUNT

COUNT_DISTINCT

EARLIEST_BY_OFFSET

HISTOGRAM

LATEST_BY_OFFSET

MAX

MIN

STDDEV_SAMP

STDDEV_SAMPLE

SUM

TOPK

TOPKDISTINCT

`AVG`

`COLLECT_LIST`

`COLLECT_SET`

`CORRELATION`

`COUNT`

`COUNT_DISTINCT`

`EARLIEST_BY_OFFSET`

`HISTOGRAM`

`LATEST_BY_OFFSET`

`MAX`

`MIN`

`STDDEV_SAMP`

`STDDEV_SAMPLE`

`SUM`

`TOPK`

`TOPKDISTINCT`