Monitor and Manage Flink SQL Statements in Confluent Cloud for Apache Flink¶
You start a stream-processing app on Confluent Cloud for Apache Flink® by running a SQL statement. Once a statement is running, you can monitor its progress by using the Confluent Cloud Console. Also, you can set up integrations with monitoring services like Prometheus and Datadog.
View and monitor statements in Cloud Console¶
Cloud Console shows details about your statements on the Flink page.
If you don’t have running statements currently, run a SQL query like INSERT INTO FROM SELECT in the Flink SQL shell or in a workspace.
Log in to the Confluent Cloud Console.
Navigate to the Environments page.
Click the tile that has the environment where your Flink compute pools are provisioned.
Click Flink, and in the Flink page, click Flink statements.
The Statements list opens.
You can use the Filter options on the page to identify the statements you want to view.
The following information is available in the Flink statements table to help you monitor your statements.
Field Description Flink Statement Name The name of the statement. The name is populated automatically when a statement is submitted. You can set the name by using the SET command. Status The statement status Represents what is currently happening with the statement. These are the status values:
- Pending: The statement has been submitted and Flink is preparing to start running the statement.
- Running: Flink is actively running the Flink statement.
- Completed: The statement has completed all of its work.
- Deleting: The statement is being deleted.
- Failed: The statement has encountered an error and is no longer running.
- Degraded: The statement appears unhealthy, for example, no transactions have been committed for a long time, or the statement has frequently restarted recently.
- Stopping: The statement is about to be stopped.
- Stopped: The statement has been stopped and is no longer running.
Statement Type The type of SQL function that is used in the statement. Created Indicates when the statement started running. If you stop and resume the statement, the Created date shows the date when the statement was first submitted. Messages Behind The Consumer Lag of the statement. You are also shown an indicator of whether the back pressure is increasing, decreasing, or if the back pressure is being maintained at a stable rate. Ideally, the Messages Behind metric should be as close to zero as possible. A low, close-to-zero consumer lag is the best indicator that your statement is running smoothly and keeping up with all of its inputs. A growing consumer lag indicates there is a problem. Messages in The count of Messages in per minute which represents the rate at which records are read. You also have a watermark for the messages read. The watermark displayed in the Flink statements table is the minimum watermark from the source(s) in the query. Messages out The count of Messages out per minute which represents the rate at which records are written. You also have a watermark for the messages written. The watermark displayed in the Flink statements table is the minimum watermark from the sink(s) in the query. Account The name of the user account or service account the statement is running with. When you click on a particular statement a detailed side panel opens up. The panel provides detailed information on the statement at a more granular level, showing how messages are being read from sources and written to sinks. The watermarks for each individual source and sink table are shown in this panel along with the statement’s catalog, database, local time zone, and Scaling status .
The SQL Content section shows the code used to generate the statement.
The panel also contains visual interactive graphs of statement’s performance over time. There are charts for # Messages behind, Messages in per minute, and Messages out per minute.
Manage statements in Cloud Console¶
Cloud Console gives you actions to manage your statements on the Flink page.
In the statement list, click the checkbox next to one of your statements to select it.
Click Actions.
A menu opens, showing options for managing the statement’s status. You can select Stop statement, Resume statement, or Delete statement.
Flink metrics integrations¶
Confluent Cloud for Apache Flink supports metrics integrations with services like Prometheus and Datadog.
If you don’t have running statements currently, run a SQL query like INSERT INTO FROM SELECT in the Flink SQL shell or in a workspace.
Log in to the Confluent Cloud Console.
Open the Administration menu () and select Metrics to open the Metrics integration page.
In the Explore available metrics section, click the Metric dropdown.
Scroll until you find the Flink compute pool and Flink statement metrics, for example, Messages behind. This list doesn’t include all available metrics. For a full list of available metrics, see Metrics API Reference.
Click the Resource dropdown and select the corresponding compute pool or statement that you want to monitor.
A graph showing the most recent data for your selected Flink metric displays.
Click New integration to export your metrics to a monitoring service. For more information, see Integrate with third-party monitoring.
Error handling and recovery¶
Confluent Cloud for Apache Flink classifies exceptions that occur during the runtime of a statement
into two categories: USER
and SYSTEM
exceptions.
- USER: Exceptions are classified as
USER
if they fall into the user’s responsibility. Examples includes deserialization or arithmetic exceptions. Usually, the root cause is related to the data or the query.USER
exceptions are forwarded to the user via theStatement.status.statusDetails
. - SYSTEM: Exceptions are classified as
SYSTEM
if they fall into Confluent’s responsibility. Examples include exceptions during checkpointing or networking. Usually, the root cause is related to the infrastructure.
Furthermore, Confluent Cloud for Apache Flink classifies exceptions as “recoverable” (or “transient”)
or “non-recoverable” (or “permanent”). SYSTEM
exceptions are always
classified as recoverable. Usually, USER
exceptions are classified as
non-recoverable. For example, a division-by-zero or a deserialization
exception isn’can’t be solved by restarting the underlying Flink job, because
the same input message is replayed and leads to the same exception again.
Some USER
exceptions are classified as recoverable, for example,
the deletion of a statement’s input or output topic, or the deletion of the
access rights to these topics.
If a non-recoverable exception occurs, the Flink statement moves into the
FAILED
state, and the underlying Flink job is cancelled. FAILED
statements do not consume any CFUs. FAILED
statements can be resumed, like
STOPPED
statements with exactly-once semantics, but in most cases, some
change to the query or data is required so that the statement doesn’t
transition immediately into the FAILED
state again. For more information on
the available options for evolving statements, see
Schema and Statement Evolution.
Note
Confluent is actively working on additional options for handling non-recoverable exceptions, like skipping the offending message or sending it to a dead-letter-queue automatically. If you’re interested in providing feedback or feature requests, contact Support or your account manager.
Degraded statements¶
If a recoverable exception occurs, the statement stays in the RUNNING
state, and the underlying Flink job is restarted. If the job is restarted
repeatedly or is not recovering within a short period of time, the statement
moves to the DEGRADED
state. DEGRADED
statements continue to consume
CFUs. If the DEGRADED
state is caused by a USER
exception, this is
shown in Statement.status.statusDetails
. If no exception is shown in the
Statement.status.statusDetails
, the DEGRADED
state is caused by a
SYSTEM
exception. In this case, contact Support.
A statement transitions from RUNNING
to DEGRADED
when either of these
conditions is met:
- There is no successful checkpoint within 10 minutes, or
- A failure occurs, and a second failure occurs 1 minute later.
The time periods for a statement to be considered stuck in the PENDING
state are:
- 10 minutes on AWS
- 10 minutes on Google Cloud
- 30 minutes on Azure. The time for Azure is longer, because it can take up to 20 minutes to start a new node.
Notifications¶
Confluent Cloud for Apache Flink integrates with Notifications for Confluent Cloud. The following notifications are available for Flink statements. They apply only to DML statements, for example, INSERT INTO, EXECUTE STATEMENT SET, or CREATE TABLE AS.
- Statement failure: This notification is of severity “critical”.
It is triggered when a statement transitions from
RUNNING
toFAILED
. A statement transitions toFAILED
on exceptions that Confluent classifies asUSER
, as opposed toSYSTEM
. - Statement degraded: This notification is of severity “warning”.
It is triggered when a statement transitions from
RUNNING
toDEGRADED
. - Statement returned to pending: This notification is of severity
“critical”. It is triggered when a statement transitions from
RUNNING
toPENDING
. This may happen if the compute pool doesn’t have enough resources to keep all statements running with their minimum resource requirements. - Statement stuck in pending: This notification is of severity
“info”. It is triggered when a newly submitted statement stays in
PENDING
for a long time. - Statement auto-stopped: This notification is of severity “warning”.
It is triggered when a statement moves into
STOPPED
because the compute pool it is using was deleted by a user.
Best practices for alerting¶
Use the Metrics API and Notifications for Confluent Cloud to monitor your compute pools and statements over time. You should monitor and configure alerts for the following conditions:
- Per compute pool
- Alert on exhausted compute pools by comparing the current CFUs
(
io.confluent.flink/compute_pool_utilization/current_cfus
) to the maximum CFUs of the pool (io.confluent.flink/compute_pool_utilization/cfu_limit
). - Flink statement stuck in pending and Flink Statement returned to pending notifications also indicate compute-pool exhaustion.
- Alert on exhausted compute pools by comparing the current CFUs
(
- Per statement
- Alert on statement failures (see Notifications)
- Alert on Statement degradation (see Notifications)
- Alert on Statements returning pending (see Notifications)
- Alert on a increase of “Messages Behind”/”Consumer Lag” (metric name:
io.confluent.flink/pending_records
) over an extended period of time, for example > 10 minutes; your mileage may vary. Note that Confluent Cloud for Apache Flink does not appear as a consumer in the regular consumer lag monitoring feature in Confluent Cloud, because it uses theassign()
method. - (Optional) Alert on an increase of the difference between the output
(
io.confluent.flink/current_output_watermark_ms
) and input watermark (io.confluent.flink/current_input_watermark_ms
). The input watermark corresponds to the time up to which the input data is complete, and the output watermark corresponds to the time up to which the output data is complete. This difference can be considered as a measure of the amount of data that’s currently “in-flight”. Depending on the logic of the statement, different patterns are expected. For example, for a tumbling event-time window, expect an increasing difference until the window is fired, at which point the difference drops to zero and starts increasing again.