Detect PII in Text with Confluent Cloud for Apache Flink

You can detect and protect personally identifiable information (PII) directly in your streaming data. Instead of building custom detection logic or routing data to external services, you can detect and redact PII with a single SQL function call as data flows through your streams.

For example, when customer messages, support tickets, or application logs contain social security numbers, email addresses, or phone numbers, you can detect this sensitive information and redact it before the data reaches downstream systems.

Confluent Cloud for Apache Flink® integrates PII detection capabilities on your streams by using this function:

AI_DETECT_PII (Early Access Program): Detect and redact personally identifiable information in text.

AI_DETECT_PII

You can use the AI_DETECT_PII function to scan text for personally identifiable information. The function uses Piiranha by default, a built-in model hosted in Confluent Cloud, to identify PII in your data.

AI_DETECT_PII supports two modes:

Detection mode (default): Identifies PII in the text and returns structured results with PII types and locations.
Redaction mode: Replaces detected PII with redacted values.

Note

The AI_DETECT_PII function is an Early Access Program feature in Confluent Cloud.

An Early Access feature is a component of Confluent Cloud introduced to gain feedback. This feature should be used only for evaluation and non-production testing purposes or to provide feedback to Confluent, particularly as it becomes more widely available in follow-on preview editions.

Early Access Program features are intended for evaluation use in development and testing environments only, and not for production use. Early Access Program features are provided: (a) without support; (b) “AS IS”; and (c) without indemnification, warranty, or condition of any kind. No service level commitment will apply to Early Access Program features. Early Access Program features are considered to be a Proof of Concept as defined in the Confluent Cloud Terms of Service. Confluent may discontinue providing preview releases of the Early Access Program features at any time in Confluent’s sole discretion.

To participate in the Early Access Program, see the sign-up form.

Parameters

The AI_DETECT_PII function accepts the following parameters.

text

Type: STRING
Required: Yes

The text to scan for personally identifiable information. The text must not be null.

config

Type: STRING
Required: No

A JSON object with configuration parameters. The configuration must be constant within a query.

Configuration parameters

You can pass the following configuration parameters in the JSON object for the config parameter.

mode

Type: STRING
Default: detect

The processing mode. Set to 'detect' to identify PII and return structured results, or 'redact' to replace detected PII with redacted values.

model

Type: STRING
Default: piiranha-v1-pii

The model to use for PII detection. The following models are available:

piiranha-v1-pii (aliases: piiranha, piiranha-v1)
en-core-web-lg (alias: presidio)

Note

AI_DETECT_PII uses built-in models hosted in Confluent Cloud. You cannot use your own managed models or remote models from cloud providers with this function.

pii_types

Type: ARRAY<STRING>
Default: All types

The PII types to detect. When specified, the function detects only the listed types. For example, '["PERSON", "EMAIL"]' detects only person names and email addresses.

redaction_strategy

Type: STRING
Default: mask

The strategy to use for redacting detected PII when mode is set to 'redact'. The following strategies are available:

mask
hash
remove

score_threshold

Type: DOUBLE
Default: 0.0

The minimum confidence score for a PII detection to be included in the results. The value must be between 0.0 and 1.0. Detections with scores below this threshold are excluded.

Output

The AI_DETECT_PII function returns a ROW data type with the following fields.

has_pii

Type: BOOLEAN

TRUE if the text contains detected PII; otherwise FALSE.

entities

Type: ARRAY<ROW<entity_type STRING, text STRING, start INT, end INT, score DOUBLE>>

An array of detected PII entities. Each entry contains the following fields:

entity_type: The type of PII detected (for example, 'PERSON' or 'EMAIL').
text: The matched text.
start: The start index of the match in the input text.
end: The end index of the match in the input text.
score: The confidence score as a double value from 0.0 to 1.0.

redacted_text

Type: STRING

The input text with detected PII replaced according to the redaction_strategy. This field is null in detection mode.

metadata

Type: STRING

A JSON string containing model metadata. This field is null if no metadata is available.

Examples

The following examples demonstrate how to use the AI_DETECT_PII function.

Detecting PII in customer messages

The following example creates a table for customer messages, inserts sample data, and runs PII detection.

Create a table for customer messages.

CREATE TABLE customer_messages (
    message_id BIGINT,
    message STRING
);

Insert sample messages that contain PII.

INSERT INTO customer_messages VALUES
    (1, 'Customer John Doe, SSN 078-05-1120, contacted us at john.doe@example.com'),
    (2, 'Order confirmed for Jane Smith. Contact: 555-867-5309'),
    (3, 'Shipment tracking update for order #12345');

Run PII detection on the messages.

SELECT
    message_id,
    message,
    AI_DETECT_PII(message) AS pii_result
FROM customer_messages;

Redacting PII in text

The following example uses redaction mode to replace detected PII in the redacted_text output field.

SELECT
    message_id,
    message,
    AI_DETECT_PII(
        message,
        JSON_OBJECT(
            'mode' VALUE 'redact',
            'redaction_strategy' VALUE 'mask'
        )
    ) AS pii_result
FROM customer_messages;