Troubleshoot Confluent for Kubernetes

This topic provides general information about troubleshooting your Confluent Platform deployment, the tools available, and some of the issues you might encounter and how to troubleshoot the issues.

Support bundle

You can create a support bundle to provide Confluent all of the required information for debugging.

Note

CFK support bundles are not currently supported for Windows.

Confluent for Kubernetes (CFK) aggregates information, such as events, Kubernetes versions, the log, Confluent APIs status, in a tar.gz file for you to upload to the Confluent Support site.

The following information is collected in a support bundle:

  • For the Kubernetes cluster (in the top-level directory):
    • The Kubernetes events (kubectl get events)
    • The Kubernetes server version (kubectl version)
  • For each Confluent Platform component (in the /clusters/component directory):
    • The config maps generated by CFK for the component
    • The pod logs for each pod in the component service (kubectl logs <pod-name>)
    • The custom resource object for the component (kubectl get <component_cr_type> <component_name>)
    • The StatefulSet object generated by CFK for the component (kubectl get sts <component-sts-name>)
  • For the Confluent for Kubernetes Operator deployment (in the /operator directory):
    • The CFK Operator pod logs (kubectl logs confluent-operator)
    • The deployment object generated for the CFK Operator (kubectl get deployment confluent-operator -oyaml)

To create a support bundle, run the following command using Confluent plugin.

kubectl confluent support-bundle --namespace <namespace>

To see other flags you can use to customize your support bundle, run:

kubectl confluent support-bundle -h

For a sample support bundle, see Appendix: Sample support bundle contents.

Logs

Logs are sent directory to STDOUT for each pod. Use the command below to view the logs for a pod:

kubectl logs <pod-name> -n <namespace>

Additionally, in component custom resources, you can change the log level to DEBUG using the Configuration overrides feature. For example, add the following in your Kafka CR to get more details in the Kafka logs:

spec:
  configOverrides:
    log4j:
      - log4j.rootLogger=DEBUG, stdout

Metrics

  • JMX metrics are available on port 7203 of each pod.
  • Jolokia (a REST interface for JMX metrics) is available on port 7777 of each pod.

Debug

There are several types of problems that can go wrong while using Confluent for Kubernetes (CFK):

  • A problem happens while deploying CFK.

  • A problem exists at the infrastructure level.

    Something has gone wrong at the Kubernetes layer.

  • A problem exists at the application level.

    The infrastructure is fine but something has gone wrong with Confluent Platform itself. Typically, this is caused by how Confluent Platform components were configured.

To debug deployment problems, run the Helm install command with the --set debug="true" to enable verbose output:

helm upgrade --install confluent-operator \
  confluentinc/confluent-for-kubernetes \
  --namespace <namespace> \
  --set debug="true"

Look for Kubernetes issues first, then debug Confluent Platform.

  1. Check for potential Kubernetes errors by entering the following command:

    kubectl get events -n <namespace>
    
  2. Check for a specific resource issue, enter the following command (using the resource type example pods):

    kubectl describe pods <podname> -n <namespace>
    
  3. If everything looks okay after running the commands above, check the individual pod logs using the following command:

    kubectl logs <pod name> -n <namespace>
    

    Confluent Platform containers are configured so application logs are printed to STDOUT. The logs can be read directly with this command. If there is anything wrong at the application level, like an invalid configuration, this will be evident in the logs.

    Note

    If a pod has been replaced because it crashed and you want to check the previous pod’s logs, add --previous to the end of the command above.

Troubleshooting problems caused by the datacenter infrastructure, such as virtual machine (VM) firewall rules, DNS configuration, etc., should be resolved by infrastructure system administrator.

Troubleshoot known issues

This section describes a few of the potential issues you might encounter while deploying Confluent Platform and presents the steps to troubleshoot the issues.

NOTE: The examples in this section assume the default namespace you set in Create a namespace for CFK.

Issue: Confluent Control Center cannot use auto-generated certificates for MDS or Confluent Cloud Schema Registry

When TLS is enabled, and when Confluent Control Center uses a different TLS certificate to communicate with MDS or Confluent Cloud Schema Registry, Control Center cannot use an auto-generated TLS certificate to connect to MDS or Confluent Cloud Schema Registry.

Solution: To encrypt Confluent Control Center traffic to MDS or Confluent Cloud Schema Registry, provide custom TLS certificates:

  1. Add to the Confluent Control Center TLS secret a custom truststore that includes the root CA to trust Confluent Cloud Schema Registry or MDS.

  2. Use the above secret name in the Confluent Control Center custome resource (CR):

    spec:
      dependencies:
        mds:
          tls:
            secretRef: <custom Root CA>
    
  3. Disable auto-generated certs and provide the RootCA in Confluent Control Center custom resource:

    spec:
      tls:
        autoGeneratedCerts: false
        secretRef: <custom Root CA>
    
  4. Deploy or redeploy Confluent Control Center.

Issue: ksqlDB cannot use auto-generated certificates for Confluent Cloud

When ksqlDB communicates with Confluent Cloud, ksqlDB cannot use an auto-generated TLS certificate.

Solution: To encrypt ksqlDB traffic to Confluent Cloud, provide custom TLS certificates:

  1. Provide a custom truststore that includes the Let’s Encrypt Root CA.

  2. Disable auto-generated certs and provide the RootCA in ksqlDB custom resource:

    spec:
      tls:
        autoGeneratedCerts: false
        secretRef: <custom Root CA>
    
  3. Deploy or redeploy ksqlDB.

Issue: Error for running Schema Registry with mismatching replication factor

The Schema Registry deployed with Confluent for Kubernetes uses the default replication factor of 3 for internal topics. When deploying Schema Registry and Kafka simultaneously and when Kafka brokers have not fully started, Schema Registry creates topics with fewer than 3 replication factor, and later Schema Registry starts failing with an error.

For example, if Kafka has one broker up, and Schema Registry is trying to come up, it will create a topic with replication factor of 1. Subsequently, when all 3 brokers come up, Schema Registry will complain that there are topics that were only created with replication factor 1 and will fail to start.

Solution: Delete the Schema Registry deployment and re-deploy once Kafka is fully up. Or reconfigure the topics to align with the Schema Registry default replication factor of 3.

Issue: Errors with container process and directory permissions when deploying Confluent Platform 6.2.x

You might see a permission error when starting ZooKeeper and other Confluent Platform 6.2.x components with CFK. For example:

Error: failed to start container "zookeeper": Error response from daemon: OCI
runtime create failed: container_linux.go:367: starting container process
caused: chdir to cwd ("/home/appuser") set in config.json failed: permission
denied: unknown

Confluent Platform 6.2.x configures the default user appuser to be mapped to 1000 with base directory /home/appuser.

Solution: Set the Pod Security Context to match the new Confluent Platform configuration.

  1. Change the podSecurityContext config for all the Confluent Platform component custom resources (CRs):

    spec:
      podTemplate:
        podSecurityContext:
          fsGroup: 1000
          runAsUser: 1000
          runAsNonRoot: true
    
  2. Apply the CR changes for each component:

    kubectl apply -f <component CR>
    

Issue: Unable to delete Kubernetes resources

Kubernetes resources cannot be deleted, for example, if CFK pod gets deleted before other resources, if CFK can’t delete resources, or if the namespace is in termination state.

Solution: Remove the finalizer using the following command:

kubectl get <resource> --no-headers | \
  awk '{print $1 }' | \
  xargs kubectl patch <resource> -p '{"metadata":{"finalizers":[]}}' \
  --type=merge

Issue: ConfluentRoleBindings stuck in DELETING

The ConfluentRolebindings custom resources (CRs) can be stuck in the DELETING state if associated Kafka cluster is removed.

Solution: Manually remove the finalizer for those ConfluentRolebindings CRs as shown in the following example:

  1. Check the status of the ConfluentRolebindings custom resources:

    kubectl get cfrb
    

    The output should have the DELETING status:

    NAME                         STATUS     KAFKACLUSTERID           PRINCIPAL        ROLE
    c3-connect-operator-7gffem   DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          SystemAdmin
    c3-ksql-operator-7gffem      DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          ResourceOwner
    c3-operator-7gffem           DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          ClusterAdmin
    c3-sr-operator-7gffem        DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          SystemAdmin
    connect-operator-7gffem-0    DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SystemAdmin
    connect-operator-7gffem-1    DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SystemAdmin
    internal-connect-0           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SecurityAdmin
    internal-connect-1           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     ResourceOwner
    internal-connect-2           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     DeveloperWrite
    
  2. Remove the finalizer of each ConfluentRolebindings CR:

    for rb in $(kubectl get cfrb --no-headers | grep "DELETING" | awk '{print $1}'); \
      do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \
      --type=merge; \
      done
    
  3. If you want to remove the finalizer for all the ConfluentRolebindings CRs in a namespace no matter their status, for example, when the namespace stuck in the terminating state and need to clean up all the ConfluentRolebindings, run the following command:

    for rb in $(kubectl get cfrb --no-headers -ojsonpath='{.items[*].metadata.name}'); \
      do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \
      --type=merge; \
      done
    

Issue: KafkaTopic is stuck in DELETING/DELETE_REQUESTED state

When you delete a Kafka topic (kubectl delete kafkatopic <topic-name>) uses Kubernetes finalizer feature to remove the topic from the destination Kafka cluster. If the finalizer fails to delete because of network issue or unavailable of Kafka clusters (deleted), the kubectl delete command will hang.

Solution: Patch the kafkatopic with following commands to remove the topic resource:

kubectl patch kafkatopic <topic-name> -p '{"metadata":{"finalizers":[]}}' \
  --type=merge

Delete Confluent Platform component pods

To manually delete Confluent Platform component pods, run the following command for each component.

Warning

This will completely bring down the component. Use caution when deleting Kafka pod as it will trigger data loss.

kubectl delete pod -l platform.confluent.io/type=<cr-type>

<cr-type> is one of the following: kafka, zookeeper, schemaregistry, ksqldb, connect, controlcenter

If a pod is already in the crashloopback state, the CFK will not honor the changes until pod goes back to the running state. You can use --force --grace-period=0 with the above command.

Block Kubernetes object reconciliation

CFK includes multiple controllers. It’s a controller’s job to ensure that, for any given object, the actual state of the world (both the cluster state, and potentially external state like running containers for Kubelet or loadbalancers for a cloud provider) matches the desired state in the object. This process is called reconciling.

Solution:

  • To block reconciliation (often needed when upgrading), use the following command to add the annotation:

    kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile=true
    

    <cr-type> is one of the following: kafka, zookeeper, schemaregistry, ksqldb, connect, controlcenter

  • To remove this annotation to stop blocking reconciliation. If you do not remove, subsequent changes on the CustomResource will be ignored:

    kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile-
    
  • To force trigger the reconciliation, add the following annotation:

    kubectl annotate <cluster-type> <cluster_name> platform.confluent.io/force-reconcile=true
    

    After the reconcile is triggered, this annotation will be disabled automatically. This command is useful in situation where the auto-generated certificate is expired and requires the CFK to be notified to create a new certificate.

Issue: Kafka listeners cannot share the same TLS secret reference

For a Kafka custom resource, two different listeners, for example, internal and external listeners, can’t share the same TLS secret reference. If configured, it will impact the volume mount path and statefulset will fail.

Workaround: Use global TLS if you need to share TLS configuration as shown below:

spec:
  tls:
    secretRef: tls-shared-certs
  listeners:
    internal:
      tls:
        enabled: true
    external:
      tls:
        enabled: true

Fixed Version: This limitation is fixed in CFK 2.0.2.

Warning: Operation cannot be fulfilled, object has been modified

You can ignore the following Warning as CFK will automatically rework to apply the changes.

Operation cannot be fulfilled xxxx, the object has been modified please apply
your changes to the latest version and try again.

Solution: In most scenarios, these are benign and will go away. If you continue seeing the same type of Warning repeatedly, create a support ticket for further investigation.

Appendix: Sample support bundle contents

The following is a sample list of logs generated for a support bundle.

tar -xzf support-bundle-ns-confluent.tar.gz
ls -al -R support-bundle
drwxr-xr-x  8 omar  operator   256B Dec 21 17:03 clusters
-rw-r--r--  1 omar  operator     3B Dec 21 17:02 event.yaml

# The version of Kubernetes being run
-rw-r--r--  1 omar  operator   215B Dec 21 17:02 k8s-version.yaml
drwxr-xr-x  4 omar  operator   128B Dec 21 17:03 operator

support-bundle/clusters:
total 0
drwxr-xr-x  7 omar  operator   224B Dec 21 17:03 connect
drwxr-xr-x  6 omar  operator   192B Dec 21 17:03 controlcenter
drwxr-xr-x  8 omar  operator   256B Dec 21 17:03 kafka
drwxr-xr-x  7 omar  operator   224B Dec 21 17:03 ksqldb
drwxr-xr-x  6 omar  operator   192B Dec 21 17:03 schemaregistry
drwxr-xr-x  8 omar  operator   256B Dec 21 17:03 zookeeper

support-bundle/clusters/connect:
total 2176
# The configmaps generated by CFK for the component (Connect)
-rw-r--r--  1 omar  operator    23K Dec 21 17:02 configmaps.yaml
# Pod logs for each pod in the component service
-rw-r--r--  1 omar  operator   512K Dec 21 17:02 connect-0.log
-rw-r--r--  1 omar  operator   515K Dec 21 17:02 connect-1.log
# The Custom Resource object for the component
-rw-r--r--  1 omar  operator   5.3K Dec 21 17:02 connect.yaml
# The statefulset object generated by CFK for the component
-rw-r--r--  1 omar  operator    24K Dec 21 17:02 statefulset.yaml

support-bundle/clusters/controlcenter:
total 7952
-rw-r--r--  1 omar  operator    22K Dec 21 17:02 configmaps.yaml
-rw-r--r--  1 omar  operator   3.8M Dec 21 17:02 controlcenter-0.log
-rw-r--r--  1 omar  operator   6.5K Dec 21 17:02 controlcenter.yaml
-rw-r--r--  1 omar  operator    25K Dec 21 17:02 statefulset.yaml

support-bundle/clusters/kafka:
total 23056
-rw-r--r--  1 omar  operator    45K Dec 21 17:02 configmaps.yaml
-rw-r--r--  1 omar  operator   4.4M Dec 21 17:02 kafka-0.log
-rw-r--r--  1 omar  operator   3.2M Dec 21 17:02 kafka-1.log
-rw-r--r--  1 omar  operator   3.5M Dec 21 17:02 kafka-2.log
-rw-r--r--  1 omar  operator    12K Dec 21 17:02 kafka.yaml
-rw-r--r--  1 omar  operator    26K Dec 21 17:02 statefulset.yaml

support-bundle/clusters/ksqldb:
total 3248
-rw-r--r--  1 omar  operator    17K Dec 21 17:02 configmaps.yaml
-rw-r--r--  1 omar  operator   925K Dec 21 17:02 ksqldb-0.log
-rw-r--r--  1 omar  operator   639K Dec 21 17:02 ksqldb-1.log
-rw-r--r--  1 omar  operator   5.3K Dec 21 17:02 ksqldb.yaml
-rw-r--r--  1 omar  operator    25K Dec 21 17:02 statefulset.yaml

support-bundle/clusters/schemaregistry:
total 1280
-rw-r--r--  1 omar  operator    17K Dec 21 17:02 configmaps.yaml
-rw-r--r--  1 omar  operator   583K Dec 21 17:02 schemaregistry-0.log
-rw-r--r--  1 omar  operator   5.3K Dec 21 17:02 schemaregistry.yaml
-rw-r--r--  1 omar  operator    24K Dec 21 17:02 statefulset.yaml

support-bundle/clusters/zookeeper:
total 1056
-rw-r--r--  1 omar  operator    17K Dec 21 17:02 configmaps.yaml
-rw-r--r--  1 omar  operator    24K Dec 21 17:02 statefulset.yaml
-rw-r--r--  1 omar  operator   155K Dec 21 17:02 zookeeper-0.log
-rw-r--r--  1 omar  operator   172K Dec 21 17:02 zookeeper-1.log
-rw-r--r--  1 omar  operator   146K Dec 21 17:02 zookeeper-2.log
-rw-r--r--  1 omar  operator   3.5K Dec 21 17:02 zookeeper.yaml

support-bundle/operator:
total 11680
-rw-r--r--  1 omar  operator   5.7M Dec 21 17:02 confluent-operator-6f5bf494b6-ngjjh.log
-rw-r--r--  1 omar  operator   8.9K Dec 21 17:02 deployment.yaml