Secure Stream Processing: Query Encrypted Data with Flink on Confluent Cloud

Processing sensitive data like personally identifiable information (PII) or financial records in real-time data streams presents a significant challenge. How do you perform meaningful operations like filtering, joining, or aggregating data while it remains fully encrypted and secure?

Traditionally, you couldn’t. But with Client-Side Field Level Encryption (CSFLE) and deterministic encryption, Confluent Cloud for Flink gives you the power to query and process encrypted data streams directly, unlocking critical use cases while ensuring your data’s privacy and compliance.

This powerful combination allows you to leverage the full capabilities of stream processing while your sensitive data remains protected from start to finish.

What is deterministic encryption?

At the core of this capability is deterministic encryption, a method where encrypting the same plaintext value with the same key always produces the exact same ciphertext. This property is what allows Flink to process the encrypted ciphertext directly, effectively performing equality comparisons, joins, and groupings on the original data without ever needing to decrypt it.

Supported operations on encrypted data

While Flink itself does not perform decryption, it operates on raw bytes. This allows it to process the ciphertext produced by CSFLE, and because the encryption is deterministic, you can:

  • Process non-encrypted fields in your data stream without any limitations.
  • Run powerful SQL queries that operate directly on your encrypted fields.

Here are some of the key operations possible on deterministically encrypted columns:

  • Filtering and equality: Use encrypted fields in WHERE clauses for exact matches.
  • Grouping and aggregation: Perform GROUP BY operations on encrypted fields. The only aggregation functions that work correctly are those based on uniqueness comparison, such as COUNT and COUNT(DISTINCT).
  • Joins: Join multiple streams together using an encrypted column (for example, joining a stream of user activity to a stream of user profiles on an encrypted user ID).
  • Window functions: Use comparison-based window functions like LEAD and LAG.

Example: Query on an encrypted column

Suppose you want to count the number of active users, grouping by the deterministically encrypted email field:

SELECT COUNT(DISTINCT email_encrypted)
FROM users_stream
WHERE status = 'ACTIVE';

This example shows a common use case — counting unique, active users — where the encrypted email_encrypted field can be grouped or filtered without being decrypted, leveraging deterministic encryption.

Important limitations and trade-offs

This capability comes with two important considerations you must understand:

  1. Limited aggregation functions: Because the data’s actual value is never revealed to Flink, mathematical operations do not produce correct results. Aggregation functions like SUM, AVG, MIN, and MAX execute but yield erroneous values.
  2. The deterministic trade-off: Deterministic encryption inherently reveals when two encrypted values are identical. This is a necessary trade-off that enables querying, but it’s a piece of information that can be analyzed. You should carefully consider this when deciding which fields to encrypt deterministically.

How it works

When you use CSFLE with Flink on Confluent Cloud, the security of your data is maintained because the actual decryption only happens when the data is read from a sink (like a database or materialized view) by an authorized client application that holds the decryption keys. Flink processes the data without ever having access to the plaintext. This ensures that sensitive data cannot be exposed even in the event of a compromise within the processing environment.

Powered by Google Tink

The CSFLE implementation uses the open-source Google Tink Cryptographic library to perform deterministic encryption using the AES256_SIV algorithm. For more information on Google Tink, see the following: