Machine-Learning Preprocessing Functions in Confluent Cloud for Apache Flink¶
The following built-in functions are available for ML preprocessing in Confluent Cloud for Apache Flink®. These functions help transform features into representations more suitable for downstream processors.
ML_BUCKETIZE¶
Bucketizes numerical values into discrete bins based on split points.
- Syntax
ML_BUCKETIZE(value, splitBucketPoints [, bucketNames])
- Description
- The
ML_BUCKETIZEfunction divides numerical values into discrete buckets based on specified split points. Each bucket represents a range of values, and the function returns the bucket index or name for each input value. - Arguments
- value: Numerical expression to be bucketized. If the input value is
NaNorNULL, it is bucketized to theNULLbucket. - splitBucketPoints: Array of numerical values that define the bucket
boundaries, or split points.
- If the
splitBucketPointsarray is empty, an exception is thrown. - Any split points that are
NaNorNULLare removed from thesplitBucketPointsarray. splitBucketPointsmust be in ascending order, or an exception is thrown.- Duplicates are removed from
splitBucketPoints.
- If the
- bucketNames: (Optional) Array of names of the buckets defined in
splitBucketPoints.- If the
bucketNamesarray is not provided, buckets are namedbin_NULL,bin_1,bin_2…bin_n, withnbeing the total number of buckets insplitBucketPoints. - If the
bucketNamesarray is provided, names must be in the same order as in thesplitBucketPointsarray. - Names for all of the buckets must be provided, including the
NULLbucket, or an exception is thrown. - If the
bucketNamesarray is provided, the first name is the name for theNULLbucket.
- If the
- value: Numerical expression to be bucketized. If the input value is
- Example
-- returns 'bin_2' SELECT ML_BUCKETIZE(2, ARRAY[1, 4, 7]); -- returns 'b2' SELECT ML_BUCKETIZE(2, ARRAY[1, 4, 7], ARRAY['b_null','b1','b2','b3','b4']);
ML_CHARACTER_TEXT_SPLITTER¶
Splits text into chunks based on character count and separators.
- Syntax
ML_CHARACTER_TEXT_SPLITTER(text, chunkSize, chunkOverlap, separator, isSeparatorRegex [, trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_CHARACTER_TEXT_SPLITTERfunction splits text into chunks based on character count and specified separators. This is useful for processing large text documents into smaller, manageable pieces.If any argument other than
textisNULL, an exception is thrown.The returned array of chunks has the same order as the input.
The function tries to keep every chunk within the
chunkSizelimit, but if a chunk is more than the limit, it is returned as is.- Arguments
- text: The input text to be split. If the input text is
NULL, it is returned as is. - chunkSize: The size of each chunk. If
chunkSize < 0orchunkOverlap > chunkSize, an exception is thrown. - chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0, an exception is thrown. - separator: The separator used for splitting.
- isSeparatorRegex: Whether the separator is a regex pattern.
- trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE. - keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE. - separatorPosition: (Optional) The position of the separator. Valid
values are
STARTorEND. The default isSTART.STARTmeans place the separator at the start of the following chunk, andENDmeans place the separator at the end of the previous chunk.
- text: The input text to be split. If the input text is
- Example
-- returns ['This is the text I would like to ch', 'o chunk up. It is the example text ', 'ext for this exercise'] SELECT ML_CHARACTER_TEXT_SPLITTER('This is the text I would like to chunk up. It is the example text for this exercise', 35, 4, '', TRUE, FALSE, TRUE, 'END');
ML_FILE_FORMAT_TEXT_SPLITTER¶
Splits text into chunks based on specific file format patterns.
- Syntax
ML_FILE_FORMAT_TEXT_SPLITTER(text, chunkSize, chunkOverlap, formatName, [trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_FILE_FORMAT_TEXT_SPLITTERfunction splits text into chunks based on specific file format patterns. It uses format-specific separators to split code intelligently or structure text.The returned array of chunks has the same order as the input.
The function starts splitting the chunks with the first separator in the separators list. If a chunk is bigger than
chunkSize, the function splits the chunk recursively using the next separator in the separators list for the given file format. If separators are exhausted, and the remaining text is bigger thanchunkSize, the function returns the smallest chunk possible, even though it is bigger thanchunkSize.- Arguments
text: The input text to be split. If the input text is
NULL, it is returned as is.chunkSize: The size of each chunk. If
chunkSize < 0orchunkOverlap > chunkSize, an exception is thrown.chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0, an exception is thrown.formatName: ENUM of the format names. Valid values are:
Valid values for formatName
CCPPCSHARPELIXIRGOHTMLJAVAJAVASCRIPTJSONKOTLINLATEXMARKDOWNPHPPYTHONRUBYRUSTSCALASQLSWIFTTYPESCRIPTXML
- trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE. - keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE. - separatorPosition: (Optional) The position of the separator. Valid
values are
STARTorEND. The default isSTART.STARTmeans place the separator at the start of the following chunk, andENDmeans place the separator at the end of the previous chunk.
- Example
-- returns ['def hello_world():\n print("Hello, World!")', '# Call the function\nhello_world()'] SELECT ML_FILE_FORMAT_TEXT_SPLITTER('def hello_world():\n print("Hello, World!")\n\n# Call the function\nhello_world()\n', 50, 0, 'PYTHON');
ML_LABEL_ENCODER¶
Encodes categorical variables into numerical labels.
- Syntax
ML_LABEL_ENCODER(input, categories [, includeZeroLabel])
- Description
- The
ML_LABEL_ENCODERfunction encodes categorical variables into numerical labels. Each unique category is assigned a unique integer label. - Arguments
- input: Input value to encode.
- If the input value is
NULL,NaN, orInfinity, it is considered in the unknown category, which is given the0label. - If the input value is not one of the
categories, it is labeled as-1or0depending onincludeZeroLabel:-1ifincludeZeroLabelis TRUE and0ifincludeZeroLabelis FALSE.
- If the input value is
- categories: Arrays of category values to encode input value to.
Category values must be the same type as the
inputvalue.- If the
categoriesarray is empty, all inputs are considered to be in the unknown category, which is given the0label. - The
categoriesarray can’t beNULL, or an exception is thrown. - The
categoriesarray can’t haveNULLor duplicate values, or an exception is thrown. - The
categoriesarray must be sorted in ascending lexicographical order, or an exception is thrown.
- If the
- includeZeroLabel: (Optional) The start index for valid categories is
0. The default isFALSE.- If
includeZeroLabelisTRUE, the valid categories index starts at0, and unknown values are labeled as-1. - If
includeZeroLabelisFALSE, the valid categories index starts at1, and unknown values are labeled as0.
- If
- input: Input value to encode.
- Example
-- returns 1 SELECT ML_LABEL_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj']); -- returns 0 SELECT ML_LABEL_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj'], TRUE );
ML_MAX_ABS_SCALER¶
Scales numerical values by their maximum absolute value.
- Syntax
ML_MAX_ABS_SCALER(value, absoluteMax)
- Description
- The
ML_MAX_ABS_SCALERfunction scales numerical values by dividing them by the maximum absolute value. This preserves zero entries in sparse data. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL,NaN, orInfinity, it is returned as is. - absoluteMax: Absolute Maximum value of the feature data seen in the
dataset.
- If
absoluteMaxisNULLorNaN, an exception is thrown. - If
absoluteMaxisInfinity,0is returned. - If
absoluteMaxis0, the scaled value is returned as is.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.2 SELECT ML_MAX_ABS_SCALER(1, 5);
ML_MIN_MAX_SCALER¶
Scales numerical values to a specified range using min-max normalization.
- Syntax
ML_MIN_MAX_SCALER(value, min, max)
- Description
- The
ML_MIN_MAX_SCALERfunction scales numerical values to a specified range using min-max normalization. The function transforms values to the range[0, 1]by default, or to a custom range ifminandmaxare specified. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL,NaN, orInfinity, it is returned as is.- If
value > max, it is set to1.0. - If
value < min, it is set to0.0. - If
max == min, the range is set to1.0to avoid division by zero.
- If
- min: Minimum value of the feature data seen in the dataset. If
minisNULL,NaN, orInfinity, an exception is thrown. - max: Maximum value of the feature data seen in the dataset.
- If
maxisNULL,NaN, orInfinity, an exception is thrown. - If
max < min, an exception is thrown.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.25 SELECT ML_MIN_MAX_SCALER(2, 1, 5);
ML_NGRAMS¶
Generates n-grams from an array of strings.
- Syntax
ML_NGRAMS(input [, nValue] [, separator])
- Description
The
ML_NGRAMSfunction generates n-grams from an array of strings. N-grams are contiguous sequences of n items from a given sample of text.The ordering of the returned output is the same as the
inputarray.- Arguments
- input: Array of CHAR or VARCHAR to return n-gram for.
- If the
inputarray hasNULL, it is ignored while forming N-GRAMS. - If the
inputarray isNULLor empty, an empty N-GRAMS array is returned. - Empty strings in the
inputarray are treated as is. - Strings with only whitespace are treated as empty strings.
- If the
- nValue: (Optional) N value of n-gram function. The default is
2.- If
nValue < 1, an exception is thrown. - If
nValue > input.size(), an empty N-GRAMS array is returned.
- If
- separator: (Optional) Characters to join n-gram values with. The default is whitespace.
- input: Array of CHAR or VARCHAR to return n-gram for.
- Example
-- returns ['ab', 'cd', 'de', 'pwe'] SELECT ML_NGRAMS(ARRAY['ab', 'cd', 'de', 'pwe'], 1, '#'); -- returns ['ab#cd', 'cd#de'] SELECT ML_NGRAMS(ARRAY['ab','cd','de', NULL], 2, '#');
ML_NORMALIZER¶
Normalizes numerical values using p-norm normalization.
- Syntax
ML_NORMALIZER(value, normValue)
- Description
- The
ML_NORMALIZERfunction normalizes numerical values using p-norm normalization. This scales each sample to have unit norm. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL,NaN, orInfinity, it is returned as is. - normValue: Calculated norm value of the feature data using p-norm.
- If
normValueisNULLorNaN, an exception is thrown. - If
normValueisInfinity,0is returned. - If
normValueis0, which is only possible when all the values are0, the input value is returned as is.
- If
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.6 SELECT ML_NORMALIZER(3.0, 5.0);
ML_ONE_HOT_ENCODER¶
Encodes categorical variables into a binary vector representation.
- Syntax
ML_ONE_HOT_ENCODER(input, categories [, dropLast] [, handleUnknown])
- Description
- The
ML_ONE_HOT_ENCODERfunction encodes categorical variables into a binary vector representation. Each category is represented by a binary vector where only one element is 1 and the rest are 0. - Arguments
- input: Input value to encode. If the input value is
NULL, it is considered to be in the unknown category. - categories: Array of category values to encode input value to. The
inputargument must be of same type as thecategoriesarray.- If the
categoriesarray is empty, an exception is thrown. - The
categoriesarray can’t beNULL, or an exception is thrown. - The
categoriesarray can’t haveNULLor duplicate values, or an exception is thrown.
- If the
- dropLast: (Optional) Whether to drop the last category. The
default is
TRUE. By default, the last category is dropped, to prevent perfectly collinear features. - handleUnknown: (Optional)
ERROR,IGNORE,KEEPoptions to indicate how to handle unknown values. The default isIGNORE.- If
handleUnknownisERROR, an exception is thrown when the input is an unknown value. - If
handleUnknownisIGNORE, unknown values are ignored and values of all the columns are 0. - If
handleUnknownisKEEP, the unknown category column has value 1. - If
handleUnknownisKEEP, the last column is for the unknown category.
- If
- input: Input value to encode. If the input value is
- Example
-- returns [1, 0, 0, 0] SELECT ML_ONE_HOT_ENCODER('abc', ARRAY['abc', 'def', 'efg', 'hikj']); -- returns [0, 0, 0, 0, 1] SELECT ML_ONE_HOT_ENCODER('abcd', ARRAY['abc', 'def', 'efg', 'hik'], TRUE, 'KEEP' );
ML_RECURSIVE_TEXT_SPLITTER¶
Splits text into chunks using multiple separators recursively.
- Syntax
ML_RECURSIVE_TEXT_SPLITTER(text, chunkSize, chunkOverlap [, separators] [, isSeparatorRegex] [, trimWhitespace] [, keepSeparator] [, separatorPosition])
- Description
The
ML_RECURSIVE_TEXT_SPLITTERfunction splits text into chunks using multiple separators recursively. It starts with the first separator and recursively applies subsequent separators if chunks are still too large.If any argument other than
textisNULL, an exception is thrown.The returned array of chunks has the same order as the input.
- Arguments
- text: The input text to be split. If the input text is
NULL, it is returned as is. - chunkSize: The size of each chunk. If
chunkSize < 0orchunkOverlap > chunkSize, an exception is thrown. - chunkOverlap: The number of overlapping characters between chunks. If
chunkOverlap < 0, an exception is thrown. - separators: (Optional) The list of separators used for splitting.
The default is
["\n\n", "\n", " ", ""] - isSeparatorRegex: (Optional) Whether the separator is a regex pattern.
The default is
FALSE - trimWhitespace: (Optional) Whether to trim whitespace from chunks.
The default is
TRUE - keepSeparator: (Optional) Whether to keep the separator in the chunks.
The default is
FALSE - separatorPosition: (Optional) The position of the separator. Valid
values are
STARTorEND. The default isSTART.STARTmeans place the separator at the start of the following chunk, andENDmeans place the separator at the end of the previous chunk.
- text: The input text to be split. If the input text is
- Example
-- returns ['Hello', '. world', '!'] SELECT ML_RECURSIVE_TEXT_SPLITTER('Hello. world!', 0, 0, ARRAY['[!]','[.]'], TRUE, TRUE, TRUE, 'START');
ML_ROBUST_SCALER¶
Scales numerical values using statistics that are robust to outliers.
- Syntax
ML_ROBUST_SCALER(value, median, firstQuartile, thirdQuartile [, withCentering, withScaling)
- Description
- The
ML_ROBUST_SCALERfunction scales numerical values using statistics that are robust to outliers. It removes the median and scales the data according to the quantile range. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL,NaN, orInfinity, it is returned as is. - median: Median of the feature data seen in the training dataset. If
medianisNULL,NaN, orInfinity, an exception is thrown. - firstQuartile: First Quartile of feature data seen in the dataset. If
firstQuartileisNULL,NaN, orInfinity, an exception is thrown. - thirdQuartile: Third Quartile of feature data seen in the dataset.
- If
thirdQuartileisNULL,NaN, orInfinity, an exception is thrown. - If
thirdQuartile - firstQuartile = 0, the range is set to1.0to avoid division by zero.
- If
- withCentering: (Optional) Boolean value indicating to center the
numerical value using median before scaling. The default is
TRUE. IfwithCenteringisFALSE, the median value is ignored. - withScaling: (Optional) Boolean value indicating to scale the numerical
value using IQR after centering. The default is
TRUE. IfwithScalingisFALSE, the firstQuartile and thirdQuartile values are ignored.
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.3333333333333333 SELECT ML_ROBUST_SCALER(2, 1, 0, 3, TRUE, TRUE);
ML_STANDARD_SCALER¶
Standardizes numerical values by removing the mean and scaling to unit variance.
- Syntax
ML_STANDARD_SCALER(value, mean, standardDeviation [, withCentering] [, withScaling])
- Description
- The
ML_STANDARD_SCALERfunction standardizes numerical values by removing the mean and scaling to unit variance. This is useful for features that follow a normal distribution. - Arguments
- value: Numerical expression to be scaled. If the input value is
NULL, NaNorInfinity, it is returned as is. - mean: Mean of the feature data seen in the dataset. If
meanisNULL, NaNorInfinity, an exception is thrown. - standardDeviation: Standard Deviation of the feature data seen in the
dataset. If
standardDeviationisNULLorNaN, an exception is thrown.- If
standardDeviationisInfinity,0is returned. - If
standardDeviationis0, the value does not need to be scaled, so it is returned as is.
- If
- withCentering: (Optional) Boolean value indicating to center the
numerical value using mean before scaling. The default is
TRUE. IfwithCenteringisFALSE, the mean value is ignored. - withScaling: (Optional) Boolean value indicating to scale the numerical
value using std after centering. The default is
TRUE. IfwithScalingisFALSE, thestandardDeviationvalue is ignored.
- value: Numerical expression to be scaled. If the input value is
- Example
-- returns 0.2 SELECT ML_STANDARD_SCALER(2, 1, 5, TRUE, TRUE);