Aggregate Functions ksqlDB for Confluent Platform¶
Important
In an aggregation function, providing a *
character or an empty argument
list causes the function to return only the ROWTIME column. For example,
calling AVG(*)
or AVG()
returns the average of ROWTIME.
COLLECT_LIST
¶
- Applies to: stream, table
COLLECT_LIST(col1) => ARRAY
Returns an array containing all the values of col1
from each input
row (for the specified grouping and time window, if any).
The size of the result ARRAY can be limited to a maximum of
ksql.functions.collect_list.limit
entries, and any values beyond
this limit are ignored silently.
Note
In Confluent Cloud, the ksql.functions.collect_list.limit
config is set
to 1000 and can’t be changed.
When used with SESSION
window, sometimes two session windows are
merged together into one, when a out-of-order record with a timestamp
between the two windows is processed. In this case, the record limit is
calculated by first considering all the records from the first window,
then the out-of-order record, then the records from the second window in
the order they were originally processed.
COLLECT_SET
¶
- Applies to: stream
COLLECT_SET(col1) => ARRAY
Returns an array containing the distinct values of col1
from each
input row (for the specified grouping and time window, if any).
The size of the result ARRAY can be limited to a maximum of
ksql.functions.collect_set.limit
entries, and any values beyond this
limit are ignored silently.
Note
In Confluent Cloud, the ksql.functions.collect_set.limit
config is set
to 1000 and can’t be changed.
When used with a SESSION
window, sometimes two session windows are
merged together into one, when a out-of-order record with a timestamp
between the two windows is processed. In this case, the record limit is
calculated by first considering all the records from the first window,
then the out-of-order record, then the records from the second window in
the order they were originally processed.
CORRELATION
¶
- Applies to: stream, table
- Since: 0.29.0
CORRELATION(x, y)
Returns the Pearson correlation coefficient between columns x
and
y
. If either value in x
or y
is NULL
for a particular
row, that row is ignored.
If all rows contain NULL
for x
or y
, or if there is only one
non-null row, NaN
is returned. When there are only two non-null
rows, either 1.0
or -1.0
is returned, depending on the sign of
the slope of the line that would be drawn between the two points.
If all the values in one column are equal, NaN
is returned.
COUNT
¶
- Applies to: stream, table
COUNT(col1) COUNT(*)
Counts the number of rows.
When col1
is specified, the count returned is the number of rows
where col1
is non-null.
When *
is specified, the count returned is the total number of rows.
Tip
See COUNT in action:
COUNT_DISTINCT
¶
- Applies to: stream
- Since: 0.7.0
COUNT_DISTINCT(col1)
Returns the approximate number of unique values of col1
in a
group.
The function implementation uses HyperLogLog to estimate cardinalities of 10^9 with a typical standard error of 2 percent.
EARLIEST_BY_OFFSET
¶
- Applies to: stream
- Since: 0.10.0
EARLIEST_BY_OFFSET(col1, [ignoreNulls])
Return the earliest value for the specified column.
The earliest value in the partition has the lowest offset.
The optional ignoreNulls
parameter, available since version 0.13.0,
controls whether NULL
values are ignored. The default is to ignore
NULL
values.
sql title="Applies to: stream<br>Since: 0.13.0" EARLIEST_BY_OFFSET(col1, earliestN, [ignoreNulls])
Returns the earliest N values for the specified column as an
ARRAY
.
The earliest values in the partition have the lowest offsets.
The optional ignoreNulls
parameter, available since version 0.13.0,
controls whether NULL
values are ignored. The default is to ignore
NULL
values.
HISTOGRAM
¶
- Applies to: stream, table
HISTOGRAM(col1)
Returns a map containing the distinct string values of col1
mapped
to the number of times each one occurs for the given window.
The HISTOGRAM
function limits the number of distinct values which
can be counted to 1000, beyond which any additional entries are ignored.
When used with a SESSION
window, sometimes two session windows are
merged together into one, when a out-of-order record with a timestamp
between the two windows is processed. In this case, the 1000-record
limit is calculated by first considering all the records from the first
window, then the out-of-order record, then the records from the second
window in the order they were originally processed.
LATEST_BY_OFFSET
¶
- Applies to: stream
- Since: 0.8.0
LATEST_BY_OFFSET(col1, [ignoreNulls])
Returns the latest value for the specified column.
The latest value in the partition has the largest offset.
The optional ignoreNulls
parameter, available since version 0.13.0,
controls whether NULL
values are ignored. The default is to ignore
NULL
values.
sql title="Applies to: stream<br>Since: 0.13.0" LATEST_BY_OFFSET(col1, latestN, [ignoreNulls])
Returns the latest N values for the specified column as an ARRAY
.
The latest values have the largest offset.
The optional ignoreNulls
parameter, available since version 0.13.0,
controls whether NULL
values are ignored. The default is to ignore
NULL
values.
MAX
¶
- Applies to: stream
MAX(col1)
Returns the maximum value for a given column and window.
Rows that have col1
set to NULL
are ignored.
MIN
¶
- Applies to: stream
MIN(col1)
Returns the minimum value for a given column and window.
Rows that have col1
set to NULL
are ignored.
STDDEV_SAMP
¶
- Applies to: stream, table
- Since: 0.16.0
STDDEV_SAMP(col1)
Returns the sample standard deviation for the column.
This function returns the square of the standard deviation instead of the
standard deviation. Use the STDDEV_SAMPLE
function to compute the standard
deviation.
STDDEV_SAMPLE
¶
- Applies to: stream, table
- Since: 0.25.0
STDDEV_SAMP(col1)
Returns the sample standard deviation for the column.
SUM
¶
- Applies to: stream, table
SUM(col1)
Sums the column values.
Rows that have col1
set to NULL
are ignored.
TOPK
¶
- Applies to: stream
TOPK(col1, otherCols..., k)``
Returns the Top K values for the given column and window.
If only col1
is provided, an array of values for that column is
returned. If otherCols
are provided, a list of STRUCT
s is
returned. Each STRUCT
has a field named sort_col
that contains
value of col1
in the associated record. otherCols
are in fields
named col0
, col1
, col2
, etc., in the order they were
provided. otherCols
do not all have to be of the same type.
Rows that have col1
set to NULL
are ignored.
- Example
SELECT orderzip_code, TOPK(order_total, 5) FROM orders WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY order_zipcode EMIT CHANGES;
TOPKDISTINCT
¶
- Applies to: stream
TOPKDISTINCT(col1, k)``
Returns the distinct Top K values for the given column and window.
Rows that have col1
set to NULL
are ignored.
- Example
SELECT pageid, TOPKDISTINCT(viewtime, 5) FROM pageviews_users GROUP BY pageid EMIT CHANGES;``