Move SQL Statements to Production in Confluent Cloud for Apache Flink
Production Flink SQL statements in Confluent Cloud for Apache Flink® need environment isolation, version-controlled definitions, repeatable deployments, validated watermark and idleness configuration, and least-privilege service-account credentials. The following recommendations cover each of these areas, plus testing, workload separation, join and TTL tuning, encryption, and custom statement naming.
Use separate environments for development, staging, and production
Use Terraform or dbt (data build tool) for repeatable deployments
Separate workloads of different priorities into user-created compute pools
Use separate environments for development, staging, and production
Isolate your workloads by using separate Confluent Cloud environments for each stage of your development lifecycle. Each environment has its own Kafka clusters, compute pools, schemas, and access controls, which prevents development activity from affecting production systems.
Use service accounts with environment-specific RBAC roles to enforce the principle of least privilege. For example, developers can have broad access in the development environment but read-only access in production. For more information, see Organize environments.
Store statements in version control
Treat your Flink SQL statements as code by storing them in a Git repository. Use pull requests to review changes before deploying them. Version control gives you a clear change history, enables code review, and provides rollback capability if a deployment causes issues.
For project structure guidance, see Structure your project.
Use Terraform or dbt (data build tool) for repeatable deployments
Avoid deploying Flink SQL statements manually through the Cloud Console in production. Instead, use an infrastructure-as-code tool to make deployments repeatable and auditable:
Terraform: Manage Confluent Cloud resources and Flink SQL statements together in a single configuration. See Deploy a Flink SQL Statement Using CI/CD and Confluent Cloud for Apache Flink.
dbt: Manage Flink SQL transformations as dbt models with built-in testing and documentation. For more information, see Deploy Flink SQL Statements with dbt and Confluent Cloud for Apache Flink.
Both tools integrate with CI/CD systems such as GitHub Actions to automate deployments after you merge changes. For more information, see Automate with CI/CD.
Test statements before production deployment
Validate your Flink SQL statements in a staging environment before promoting to production:
Use the
--dry-runflag with the Confluent CLI or the REST API to validate statement syntax without executing it.Run statements against staging data with representative volumes to verify correctness and performance.
If you use dbt, run
dbt test --select "test_type:unit"to verify model logic before deployment.
For more information, see Test your statements.
Validate your watermark strategy
When moving your Flink SQL statements to production, it’s crucial to validate your watermark strategy. Watermarks in Flink track the progress of event time and provide a way to trigger time-based operations.
Confluent Cloud for Apache Flink provides a default watermark strategy for all tables, whether Flink creates them automatically from a Kafka topic or from a CREATE TABLE statement. The default watermark strategy applies to the $rowtime system column, which maps to the associated timestamp of a Kafka record. Flink calculates watermarks for this default strategy per Kafka partition, using a fixed out-of-orderness tolerance of 180 milliseconds. No minimum record count is required.
Here are some situations when you need to define your own custom watermark strategy:
When the event time needs to depend on data from the payload and not on the timestamp of the Kafka record.
When your data has out-of-orderness that exceeds the 180ms default tolerance.
Validate or disable idleness handling
One critical aspect to consider when moving your Flink SQL statements to production is the handling of idleness in data streams. If no events arrive within a certain time (timeout duration) on a Kafka partition, Flink marks that partition as idle and excludes it from the watermark calculation until a new event comes in. This situation creates a problem: if some partitions continue to receive events while others are idle, the overall watermark computation, which depends on the minimum across all parallel watermarks, can fall inaccurately behind.
To reduce the impact of idle partitions, Confluent Cloud for Apache Flink forwards the latest event time from each partition before marking it as idle, and then excludes idle partitions from the watermark calculation. Confluent Cloud for Apache Flink dynamically adjusts when it marks partitions idle with Confluent’s progressive idleness feature. The idle-time detection starts at 10 seconds but grows linearly with the age of the statement up to a maximum of 5 minutes. Progressive idleness can cause wrong watermarks if Flink marks a partition as idle too quickly, which can cause the system to move ahead too quickly, impacting data processing. At least one partition must continue to receive new data for watermarks to keep advancing.
When you move your Flink SQL statement into production, make sure that you have validated how you want to handle idleness. You can configure or disable this behavior by using the sql.tables.scan.idle-timeout option.
Choose the correct Schema Registry compatibility type
The Confluent Schema Registry plays a pivotal role in ensuring that the schemas of the data flowing through your Kafka topics are consistent, compatible, and evolve in a controlled manner. One of the key decisions in this process is selecting the appropriate schema compatibility type.
Consider using FULL_TRANSITIVE compatibility to ensure that any new schema is fully compatible with all previous versions of the schema. This comprehensive check minimizes the risk of introducing changes that could disrupt data-processing applications that rely on the data. When choosing any of the other compatibility modes, you need to consider the consequences on currently-running statements, especially because a Flink statement is both a producer and a consumer at the same time.
Separate workloads of different priorities into user-created compute pools
Confluent Cloud for Apache Flink automatically uses a default compute pool for your statements. For most use cases, including development, testing, and many production workloads, the default pool provides adequate resource isolation.
However, if you have statements with significantly different latency and availability requirements, consider creating compute pools manually for workload isolation. All statements that use the same compute pool compete for resources. Although the Confluent Cloud Autopilot aims to provide each statement with the resources it needs, this might not always be possible, in particular when the compute pool exhausts its maximum resources.
To avoid resource contention between critical and non-critical workloads, create separate compute pools manually for different use cases. For example, separate ad hoc exploration queries from mission-critical, long-running production queries. Because statements can affect each other, you should share compute pools only between statements with comparable requirements.
Use event-time temporal joins instead of streaming joins
When processing data streams, choosing the right type of join operation is crucial for efficiency and performance. Event-time temporal joins offer significant advantages over regular streaming joins.
Temporal joins are particularly useful when the join condition depends on a time attribute. They enable you to join a primary stream with a historical version of another table, using the state of that table as it existed at the time of the event. This results in more efficient processing, because it avoids the need to keep large amounts of state in memory. Traditional streaming joins involve keeping a stateful representation of all joined records, which can be inefficient and resource-intensive, especially with large datasets or high-velocity streams. Also, event-time temporal joins typically result in insert-only outputs, when your inputs are also insert-only, which means that after Flink processes and joins a record, it doesn’t update or delete the record later. Streaming joins often need to handle updates and deletions.
When moving to production, prefer using temporal joins wherever applicable to ensure your data processing is efficient and performant. Avoid traditional streaming joins unless necessary, because they can lead to increased resource consumption and complexity.
Implement state time-to-live (TTL)
Some stateful operations in Flink require storing state, such as streaming joins and pattern matching. Managing this state effectively is crucial for application performance, resource optimization, and cost reduction. The state time-to-live (TTL) feature enables specifying a minimum time interval for retaining state that doesn’t receive updates. This mechanism ensures that Flink clears state at some time after the idle duration. When moving to production, you should configure the sql.state-ttl setting carefully to balance performance versus correctness of the results.
Use service account API keys for production
You can create API keys for Confluent Cloud with user accounts and service accounts. A service account provides an identity for an application or service that needs to perform programmatic operations within Confluent Cloud. When moving to production, ensure that you use only service account API keys. Avoid user account API keys, except for development and testing. If a user leaves and you delete the user account, Confluent Cloud also deletes all API keys for that user account, and applications might break.
Follow security best practices for encrypted data
When working with Client-Side Field Level Encryption or Client-Side Payload Encryption in production, follow these security best practices to protect your encrypted data and maintain encryption effectiveness.
Restrict schema write permissions
Never grant Flink SQL users write permissions on schemas that contain encrypted fields. Users with schema write permissions can modify encryption rules to write data in plaintext, bypassing the encryption entirely. This can expose sensitive data that should remain encrypted.
Grant DeveloperWrite permission on Key Encryption Keys only to Flink SQL principals that need to produce encrypted data for the first time, because the initial write generates the Data Encryption Key. After the first write, DeveloperRead permission is sufficient for both reading and writing encrypted data. For more information about permissions, see Data access layer.
Configure Data Encryption Key expiry
Set an appropriate expiry time for Data Encryption Keys to ensure regular key rotation. Set a minimum expiry of 1 day for production workloads. Regular key rotation limits the exposure window if a key is compromised.
Configure the encrypt.dek.expiry.days property when creating encryption rules in Schema Registry. Data Encryption Keys refresh automatically every 5 minutes during statement execution. For more information about key rotation, see Protect Sensitive Data Using Client-Side Field Level Encryption on Confluent Cloud.
Monitor performance with encryption
Encryption and decryption operations add processing overhead to your Flink SQL statements. Monitor your statement performance to ensure that encryption does not negatively impact throughput or latency requirements.
Consider these performance factors:
Transparent decryption requires CPU cycles for decryption operations.
Deterministic encryption avoids decryption overhead but limits available operations.
Encrypted fields in state increase memory usage.
For more information about processing encrypted data, see Process Encrypted Data with Confluent Cloud for Apache Flink.
Assign custom names to Flink SQL statements
Custom naming facilitates management, monitoring, and debugging of your streaming applications by providing clear, identifiable references to specific operations or data flows. You can set custom names by using the client.statement-name option.
Review error handling and monitoring best practices
Review these topics: