Detect PII in Text with Confluent Cloud for Apache Flink
You can detect and protect personally identifiable information (PII) directly in your streaming data. Instead of building custom detection logic or routing data to external services, you can detect and redact PII with a single SQL function call as data flows through your streams.
For example, when customer messages, support tickets, or application logs contain social security numbers, email addresses, or phone numbers, you can detect this sensitive information and redact it before the data reaches downstream systems.
Confluent Cloud for Apache Flink® integrates PII detection capabilities on your streams by using this function:
AI_DETECT_PII (Early Access Program): Detect and redact personally identifiable information in text.
AI_DETECT_PII
You can use the AI_DETECT_PII function to scan text for personally identifiable information. The function uses Piiranha by default, a built-in model hosted in Confluent Cloud, to identify PII in your data.
AI_DETECT_PII supports two modes:
Detection mode (default): Identifies PII in the text and returns structured results with PII types and locations.
Redaction mode: Replaces detected PII with redacted values.
Note
The AI_DETECT_PII function is an Early Access Program feature in Confluent Cloud.
An Early Access feature is a component of Confluent Cloud introduced to gain feedback. This feature should be used only for evaluation and non-production testing purposes or to provide feedback to Confluent, particularly as it becomes more widely available in follow-on preview editions.
Early Access Program features are intended for evaluation use in development and testing environments only, and not for production use. Early Access Program features are provided: (a) without support; (b) “AS IS”; and (c) without indemnification, warranty, or condition of any kind. No service level commitment will apply to Early Access Program features. Early Access Program features are considered to be a Proof of Concept as defined in the Confluent Cloud Terms of Service. Confluent may discontinue providing preview releases of the Early Access Program features at any time in Confluent’s sole discretion.
To participate in the Early Access Program, see the sign-up form.
Parameters
The AI_DETECT_PII function accepts the following parameters.
text
Type: STRING
Required: Yes
The text to scan for personally identifiable information. The text must not be null.
config
Type: STRING
Required: No
A JSON object with configuration parameters. The configuration must be constant within a query.
Configuration parameters
You can pass the following configuration parameters in the JSON object for the config parameter.
mode
Type: STRING
Default:
detect
The processing mode. Set to 'detect' to identify PII and return structured results, or 'redact' to replace detected PII with redacted values.
model
Type: STRING
Default:
piiranha-v1-pii
The model to use for PII detection. The following models are available:
piiranha-v1-pii(aliases:piiranha,piiranha-v1)en-core-web-lg(alias:presidio)
Note
AI_DETECT_PII uses built-in models hosted in Confluent Cloud. You cannot use your own managed models or remote models from cloud providers with this function.
pii_types
Type: ARRAY<STRING>
Default: All types
The PII types to detect. When specified, the function detects only the listed types. For example, '["PERSON", "EMAIL"]' detects only person names and email addresses.
redaction_strategy
Type: STRING
Default:
mask
The strategy to use for redacting detected PII when mode is set to 'redact'. The following strategies are available:
maskhashremove
score_threshold
Type: DOUBLE
Default: 0.0
The minimum confidence score for a PII detection to be included in the results. The value must be between 0.0 and 1.0. Detections with scores below this threshold are excluded.
Output
The AI_DETECT_PII function returns a ROW data type with the following fields.
has_pii
Type: BOOLEAN
TRUE if the text contains detected PII; otherwise FALSE.
entities
Type: ARRAY<ROW<entity_type STRING, text STRING, start INT, end INT, score DOUBLE>>
An array of detected PII entities. Each entry contains the following fields:
entity_type: The type of PII detected (for example,'PERSON'or'EMAIL').text: The matched text.start: The start index of the match in the input text.end: The end index of the match in the input text.score: The confidence score as a double value from 0.0 to 1.0.
redacted_text
Type: STRING
The input text with detected PII replaced according to the redaction_strategy. This field is null in detection mode.
metadata
Type: STRING
A JSON string containing model metadata. This field is null if no metadata is available.
Examples
The following examples demonstrate how to use the AI_DETECT_PII function.
Detecting PII in customer messages
The following example creates a table for customer messages, inserts sample data, and runs PII detection.
Create a table for customer messages.
CREATE TABLE customer_messages ( message_id BIGINT, message STRING );
Insert sample messages that contain PII.
INSERT INTO customer_messages VALUES (1, 'Customer John Doe, SSN 078-05-1120, contacted us at john.doe@example.com'), (2, 'Order confirmed for Jane Smith. Contact: 555-867-5309'), (3, 'Shipment tracking update for order #12345');
Run PII detection on the messages.
SELECT message_id, message, AI_DETECT_PII(message) AS pii_result FROM customer_messages;
Redacting PII in text
The following example uses redaction mode to replace detected PII in the redacted_text output field.
SELECT
message_id,
message,
AI_DETECT_PII(
message,
JSON_OBJECT(
'mode' VALUE 'redact',
'redaction_strategy' VALUE 'mask'
)
) AS pii_result
FROM customer_messages;