Move SQL Statements to Production in Confluent Cloud for Apache Flink
When you move your Flink SQL statements to production in Confluent Cloud for Apache Flink®, consider the following recommendations and best practices.
Use separate environments for development, staging, and production
Use Terraform or dbt (data build tool) for repeatable deployments
Separate workloads of different priorities into user-created compute pools
Use separate environments for development, staging, and production
Isolate your workloads by using separate Confluent Cloud environments for each stage of your development lifecycle. Each environment has its own Kafka clusters, compute pools, schemas, and access controls, which prevents development activity from affecting production systems.
Use service accounts with environment-specific RBAC roles to enforce the principle of least privilege. For example, developers may have broad access in the development environment but read-only access in production. For more information, see Organize environments.
Store statements in version control
Treat your Flink SQL statements as code by storing them in a Git repository. Use pull requests to review changes before they are deployed. Version control gives you a clear change history, enables code review, and provides rollback capability if a deployment causes issues.
For project structure guidance, see Structure your project.
Use Terraform or dbt (data build tool) for repeatable deployments
Avoid deploying Flink SQL statements manually through the Cloud Console in production. Instead, use an infrastructure-as-code tool to make deployments repeatable and auditable:
Terraform: Manage Confluent Cloud resources and Flink SQL statements together in a single configuration. See Deploy a Flink SQL Statement Using CI/CD and Confluent Cloud for Apache Flink.
dbt: Manage Flink SQL transformations as dbt models with built-in testing and documentation. For more information, see Deploy Flink SQL Statements with dbt and Confluent Cloud for Apache Flink.
Both tools integrate with CI/CD systems like GitHub Actions to automate deployments when changes are merged. For more information, see Automate with CI/CD.
Test statements before production deployment
Validate your Flink SQL statements in a staging environment before promoting to production:
Use the
--dry-runflag with the Confluent CLI or the REST API to validate statement syntax without executing it.Run statements against staging data with representative volumes to verify correctness and performance.
If you use dbt, run
dbt test --select "test_type:unit"to verify model logic before deployment.
For more information, see Test your statements.
Validate your watermark strategy
When moving your Flink SQL statements to production, it’s crucial to validate your watermark strategy. Watermarks in Flink track the progress of event time and provide a way to trigger time-based operations.
Confluent Cloud for Apache Flink provides a default watermark strategy for all tables, whether they’re created automatically from a Kafka topic or from a CREATE TABLE statement. The default watermark strategy is applied on the $rowtime system column, which is mapped to the associated timestamp of a Kafka record. Watermarks for this default strategy are calculated per Kafka partition, using a fixed out-of-orderness tolerance of 180 milliseconds. No minimum record count is required.
Here are some situations when you need to define your own custom watermark strategy:
When the event time needs to be based on data from the payload and not the timestamp of the Kafka record.
When your data has out-of-orderness that exceeds the 180ms default tolerance.
Validate or disable idleness handling
One critical aspect to consider when moving your Flink SQL statements to production is the handling of idleness in data streams. If no events arrive within a certain time (timeout duration) on a Kafka partition, that partition is marked as idle and does not contribute to the watermark calculation until a new event comes in. This situation creates a problem: if some partitions continue to receive events while others are idle, the overall watermark computation, which is based on the minimum across all parallel watermarks, may be inaccurately held back.
To reduce the impact of idle partitions, Confluent Cloud for Apache Flink forwards the latest event time from each partition before marking it as idle, and then excludes idle partitions from the watermark calculation. Confluent Cloud for Apache Flink dynamically adjusts when partitions are marked idle with Confluent’s progressive idleness feature. The idle-time detection starts at 10 seconds but grows linearly with the age of the statement up to a maximum of 5 minutes. Progressive idleness can cause wrong watermarks if a partition is marked as idle too quickly, and this can cause the system to move ahead too quickly, impacting data processing. At least one partition must continue to receive new data for watermarks to keep advancing.
When you move your Flink SQL statement into production, make sure that you have validated how you want to handle idleness. You can configure or disable this behavior by using the sql.tables.scan.idle-timeout option.
Choose the correct Schema Registry compatibility type
The Confluent Schema Registry plays a pivotal role in ensuring that the schemas of the data flowing through your Kafka topics are consistent, compatible, and evolve in a controlled manner. One of the key decisions in this process is selecting the appropriate schema compatibility type.
Consider using FULL_TRANSITIVE compatibility to ensure that any new schema is fully compatible with all previous versions of the schema. This comprehensive check minimizes the risk of introducing changes that could disrupt data-processing applications relying on the data. When choosing any of the other compatibility modes, you need to consider the consequences on currently-running statements, especially since a Flink statement is both a producer and a consumer at the same time.
Separate workloads of different priorities into user-created compute pools
Confluent Cloud for Apache Flink automatically uses a default compute pool for your statements. For most use cases, including development, testing, and many production workloads, the default pool provides adequate resource isolation.
However, if you have statements with significantly different latency and availability requirements, consider creating compute pools manually for workload isolation. All statements using the same compute pool compete for resources. Although the Confluent Cloud Autopilot aims to provide each statement with the resources it needs, this may not always be possible, in particular, when the maximum resources of the compute pool are exhausted.
To avoid resource contention between critical and non-critical workloads, create separate compute pools manually for different use cases. For example, separate ad-hoc exploration queries from mission-critical, long-running production queries. Because statements may affect each other, you should share compute pools only between statements with comparable requirements.
Use event-time temporal joins instead of streaming joins
When processing data streams, choosing the right type of join operation is crucial for efficiency and performance. Event-time temporal joins offer significant advantages over regular streaming joins.
Temporal joins are particularly useful when the join condition is based on a time attribute. They enable you to join a primary stream with a historical version of another table, using the state of that table as it existed at the time of the event. This results in more efficient processing, because it avoids the need to keep large amounts of state in memory. Traditional streaming joins involve keeping a stateful representation of all joined records, which can be inefficient and resource-intensive, especially with large datasets or high-velocity streams. Also, event-time temporal joins typically result in insert-only outputs, when your inputs are also insert-only, which means that once a record is processed and joined, it is not updated or deleted later. Streaming joins often need to handle updates and deletions.
When moving to production, prefer using temporal joins wherever applicable to ensure your data processing is efficient and performant. Avoid traditional streaming joins unless necessary, as they can lead to increased resource consumption and complexity.
Implement state time-to-live (TTL)
Some stateful operations in Flink require storing state, like streaming joins and pattern matching. Managing this state effectively is crucial for application performance, resource optimization, and cost reduction. The state time-to-live (TTL) feature enables specifying a minimum time interval for how long state, meaning state that is not updated, is retained. This mechanism ensures that state is cleared at some time after the idle duration. When moving to production, you should configure the sql.state-ttl setting carefully to balance performance versus correctness of the results.
Use service account API keys for production
API keys for Confluent Cloud can be created with user accounts and service accounts. A service account is intended to provide an identity for an application or service that needs to perform programmatic operations within Confluent Cloud. When moving to production, ensure that only service account API keys are used. Avoid user account API keys, except for development and testing. If a user leaves and a user account is deleted, all API keys created with that user account are deleted, and applications might break.
Assign custom names to Flink SQL statements
Custom naming facilitates easier management, monitoring, and debugging of your streaming applications by providing clear, identifiable references to specific operations or data flows. You can do this easily by using the client.statement-name option.
Review error handling and monitoring best practices
Review these topics: