Troubleshoot Confluent for Kubernetes¶
This topic provides general information about troubleshooting your Confluent Platform deployment, the tools available, and some of the issues you might encounter and how to troubleshoot the issues.
Support bundle¶
Confluent for Kubernetes (CFK) aggregates information, such as events, Kubernetes versions,
the log, and Confluent APIs status in a tar.gz
file for you to upload to the
Confluent Support site.
You can create a support bundle to provide Confluent all of the required information for debugging.
Use the latest version of the Confluent plugin tool to generate a support bundle to capture the logs and configs for the new features that are available in the new release of CFK.
Note
The CFK support bundles are not currently supported for Windows.
The following information is collected in a support bundle:
- For the Kubernetes cluster (in the top-level directory):
- The Kubernetes events (
kubectl get events
) - The Kubernetes server version (
kubectl version
)
- The Kubernetes events (
- For each Confluent Platform cluster component (in the
/clusters-resources/component
directory):- The config maps generated by CFK for the component
- The pod logs for each pod in the component service (
kubectl logs <pod-name>
) - The custom resource object for the component (
kubectl get <component_cr_type> <component_name>
) - The StatefulSet object generated by CFK for the component (
kubectl get sts <component-sts-name>
)
- For each Confluent Platform application component (in the
/application-resources/component
directory):- The custom resource object for the component (
kubectl get <component_cr_type> <component_name>
)
- The custom resource object for the component (
- For the Confluent for Kubernetes Operator deployment (in the
/operator
directory):- The CFK Operator pod logs (
kubectl logs confluent-operator
) - The deployment object generated for the CFK Operator (
kubectl get deployment confluent-operator -oyaml
)
- The CFK Operator pod logs (
To create a support bundle, install the Confluent plugin as describe in Confluent plugin and run the following command:
kubectl confluent support-bundle --namespace <namespace>
To see other flags you can use to customize your support bundle, run:
kubectl confluent support-bundle -h
For a sample support bundle, see Appendix: Sample support bundle contents.
Logs¶
Logs are sent directly to STDOUT for each pod. Use the command below to view the logs for a pod:
kubectl logs <pod-name> -n <namespace>
Additionally, in component custom resources, you can change the log level to
DEBUG
using the Configuration overrides feature. For example, add the
following in your Kafka CR to get more details in the Kafka logs:
spec:
configOverrides:
log4j:
- log4j.rootLogger=DEBUG, stdout
Metrics¶
- JMX metrics are available on port 7203 of each pod.
- Jolokia (a REST interface for JMX metrics) is available on port 7777 of each pod.
Debug¶
There are several types of problems that can go wrong while using Confluent for Kubernetes (CFK):
A problem happens while deploying CFK.
A problem exists at the infrastructure level.
Something has gone wrong at the Kubernetes layer.
A problem exists at the application level.
The infrastructure is fine but something has gone wrong with Confluent Platform itself. Typically, this is caused by how Confluent Platform components were configured.
To debug deployment problems, run the Helm install command with the --set
debug="true"
to enable verbose output:
helm upgrade --install confluent-operator \
confluentinc/confluent-for-kubernetes \
--namespace <namespace> \
--set debug="true"
Look for Kubernetes issues first, then debug Confluent Platform.
Check for potential Kubernetes errors by entering the following command:
kubectl get events -n <namespace>
Check for a specific resource issue, enter the following command (using the resource type example pods):
kubectl describe pods <podname> -n <namespace>
If everything looks okay after running the commands above, check the individual pod logs using the following command:
kubectl logs <pod name> -n <namespace>
Confluent Platform containers are configured so application logs are printed to STDOUT. The logs can be read directly with this command. If there is anything wrong at the application level, like an invalid configuration, this will be evident in the logs.
Note
If a pod has been replaced because it crashed and you want to check the previous pod’s logs, add
--previous
to the end of the command above.
Troubleshooting problems caused by the datacenter infrastructure, such as virtual machine (VM) firewall rules, DNS configuration, etc., should be resolved by infrastructure system administrator.
Troubleshoot known issues¶
This section describes a few of the potential issues you might encounter while deploying Confluent Platform and presents the steps to troubleshoot the issues.
NOTE: The examples in this section assume the default namespace you set in Create a namespace for CFK.
Issue: An error returns while applying a CRD during an upgrade¶
As part of CFK upgrade, you need to upgrade Confluent Platform custom resource definitions (CRDs) using the following command.
kubectl apply -f confluent-for-kubernetes/crds/
You might get an error message similar to the following from the command:
The CustomResourceDefinition "kafkas.platform.confluent.io" is invalid:
metadata.annotations: Too long: must have at most 262144 bytes make: ***
[install-crds] Error 1 we should use
Solution: Run one of the following commands to apply the CRD:
kubectl apply --server-side=true -f <CRD>
kubectl replace <CRD>
Issue: Confluent Control Center cannot use auto-generated certificates for MDS or Confluent Cloud Schema Registry¶
When TLS is enabled, and when Confluent Control Center uses a different TLS certificate to communicate with MDS or Confluent Cloud Schema Registry, Control Center cannot use an auto-generated TLS certificate to connect to MDS or Confluent Cloud Schema Registry.
Solution: To encrypt Confluent Control Center traffic to MDS or Confluent Cloud Schema Registry, provide custom TLS certificates:
Add to the Confluent Control Center TLS secret a custom truststore that includes the root CA to trust Confluent Cloud Schema Registry or MDS.
Use the above secret name in the Confluent Control Center custome resource (CR):
spec: dependencies: mds: tls: secretRef: <custom Root CA>
Disable auto-generated certs and provide the RootCA in Confluent Control Center custom resource:
spec: tls: autoGeneratedCerts: false secretRef: <custom Root CA>
Deploy or redeploy Confluent Control Center.
Issue: ksqlDB cannot use auto-generated certificates for Confluent Cloud¶
When ksqlDB communicates with Confluent Cloud, ksqlDB cannot use an auto-generated TLS certificate.
Solution: To encrypt ksqlDB traffic to Confluent Cloud, provide custom TLS certificates:
Provide a custom truststore that includes the Let’s Encrypt Root CA.
Disable auto-generated certs and provide the RootCA in ksqlDB custom resource:
spec: tls: autoGeneratedCerts: false secretRef: <custom Root CA>
Deploy or redeploy ksqlDB.
Issue: Error for running Schema Registry with mismatching replication factor¶
The Schema Registry deployed with Confluent for Kubernetes uses the default replication factor of 3
for internal topics. When deploying Schema Registry and Kafka simultaneously and when Kafka
brokers have not fully started, Schema Registry creates topics with fewer than 3
replication factor, and later Schema Registry starts failing with an error.
For example, if Kafka has one broker up, and Schema Registry is trying to come up, it will
create a topic with replication factor of 1
. Subsequently, when all 3
brokers come up, Schema Registry will complain that there are topics that were only created
with replication factor 1
and will fail to start.
Solution: Delete the Schema Registry deployment and re-deploy once Kafka is fully up. Or
reconfigure the topics to align with the Schema Registry default replication factor of
3
.
Issue: Errors with container process and directory permissions when deploying Confluent Platform 6.2.x¶
You might see a permission error when starting ZooKeeper and other Confluent Platform 6.2.x components with CFK. For example:
Error: failed to start container "zookeeper": Error response from daemon: OCI
runtime create failed: container_linux.go:367: starting container process
caused: chdir to cwd ("/home/appuser") set in config.json failed: permission
denied: unknown
Confluent Platform 6.2.x configures the default user appuser
to be mapped to 1000
with
base directory /home/appuser
.
Solution: Set the Pod Security Context to match the new Confluent Platform configuration.
Change the
podSecurityContext
config for all the Confluent Platform component custom resources (CRs):spec: podTemplate: podSecurityContext: fsGroup: 1000 runAsUser: 1000 runAsNonRoot: true
Apply the CR changes for each component:
kubectl apply -f <component CR>
Issue: Unable to delete Kubernetes resources¶
Kubernetes resources cannot be deleted, for example, if CFK pod gets deleted before other resources, if CFK can’t delete resources, or if the namespace is in termination state.
Solution: Remove the finalizer using the following command:
kubectl get <resource> --no-headers | \
awk '{print $1 }' | \
xargs kubectl patch <resource> -p '{"metadata":{"finalizers":[]}}' \
--type=merge
Issue: ConfluentRoleBindings stuck in DELETING¶
The ConfluentRolebindings custom resources (CRs) can be stuck in the DELETING
state if associated Kafka cluster is removed.
Solution: Manually remove the finalizer for those ConfluentRolebindings CRs as shown in the following example:
Check the status of the ConfluentRolebindings custom resources:
kubectl get cfrb
The output should have the
DELETING
status:NAME STATUS KAFKACLUSTERID PRINCIPAL ROLE c3-connect-operator-7gffem DELETING 8itASw0_S6qDfdl72b7Uyg User:c3 SystemAdmin c3-ksql-operator-7gffem DELETING 8itASw0_S6qDfdl72b7Uyg User:c3 ResourceOwner c3-operator-7gffem DELETING 8itASw0_S6qDfdl72b7Uyg User:c3 ClusterAdmin c3-sr-operator-7gffem DELETING 8itASw0_S6qDfdl72b7Uyg User:c3 SystemAdmin connect-operator-7gffem-0 DELETING 8itASw0_S6qDfdl72b7Uyg User:connect SystemAdmin connect-operator-7gffem-1 DELETING 8itASw0_S6qDfdl72b7Uyg User:connect SystemAdmin internal-connect-0 DELETING 8itASw0_S6qDfdl72b7Uyg User:connect SecurityAdmin internal-connect-1 DELETING 8itASw0_S6qDfdl72b7Uyg User:connect ResourceOwner internal-connect-2 DELETING 8itASw0_S6qDfdl72b7Uyg User:connect DeveloperWrite
Remove the finalizer of each ConfluentRolebindings CR:
for rb in $(kubectl get cfrb --no-headers | grep "DELETING" | awk '{print $1}'); \ do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \ --type=merge; \ done
If you want to remove the finalizer for all the ConfluentRolebindings CRs in a namespace no matter their status, for example, when the namespace stuck in the terminating state and need to clean up all the ConfluentRolebindings, run the following command:
for rb in $(kubectl get cfrb --no-headers -ojsonpath='{.items[*].metadata.name}'); \ do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \ --type=merge; \ done
Issue: KafkaTopic is stuck in DELETING/DELETE_REQUESTED state¶
When you delete a Kafka topic (kubectl delete kafkatopic <topic-name>
) uses
Kubernetes finalizer feature to remove the topic from the destination Kafka
cluster. If the finalizer fails to delete because of network issue or unavailable
of Kafka clusters (deleted), the kubectl delete
command will hang.
Solution: Patch the kafkatopic
with following commands to remove
the topic resource:
kubectl patch kafkatopic <topic-name> -p '{"metadata":{"finalizers":[]}}' \
--type=merge
Delete Confluent Platform component pods¶
To manually delete Confluent Platform component pods, run the following command for each component.
Warning
This operation will completely bring down the component. Use caution when deleting the Kafka pod as it will trigger data loss.
kubectl delete pod -l platform.confluent.io/type=<cr-type>
<cr-type>
is one of the following: kafka
, zookeeper
,
schemaregistry
, ksqldb
, connect
, controlcenter
If a pod is already in the crashloopback state, the CFK will not honor the
changes until pod goes back to the running state. You can use --force
--grace-period=0
with the above command.
Block Kubernetes object reconciliation¶
CFK includes multiple controllers. It’s a controller’s job to ensure that, for any given object, the actual state of the world (both the cluster state, and potentially external state like running containers for Kubelet or loadbalancers for a cloud provider) matches the desired state in the object. This process is called reconciling.
Solution:
To block reconciliation (often needed when upgrading), use the following command to add the annotation:
kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile=true
<cr-type>
is one of the following:kafka
,zookeeper
,schemaregistry
,ksqldb
,connect
,controlcenter
To remove this annotation to stop blocking reconciliation. If you do not remove, subsequent changes on the CustomResource will be ignored:
kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile-
To force trigger the reconciliation, add the following annotation:
kubectl annotate <cluster-type> <cluster_name> platform.confluent.io/force-reconcile=true
After the reconcile is triggered, this annotation will be disabled automatically. This command is useful in situation where the auto-generated certificate is expired and requires the CFK to be notified to create a new certificate.
Warning: Operation cannot be fulfilled, object has been modified¶
You can ignore the following Warning
as CFK will automatically rework to
apply the changes.
Operation cannot be fulfilled xxxx, the object has been modified please apply
your changes to the latest version and try again.
Solution: In most scenarios, these are benign and will go away. If you
continue seeing the same type of Warning
repeatedly, create a support ticket
for further investigation.
Issue: I have a StorageClass that does not have the reclaimPolicy set to retain¶
The StorageClass(SC) of the PersistentVolume that CFK uses must be configured
with reclaimPolicy: Retain
. If the StorageClass and its PersistentVolumes
have been already created, the StorageClass cannot be changed. In this case, you
have to patch the PersistentVolume as shown below.
Solution:
List the PersistentVolumes:
kubectl get pv
Check the PersistentVolume names that CFK uses and their reclaim policy in the output.
Change the PersistentVolumes that do NOT have the reclaim policy set to
Retain
. Patch the PersistentVolumes and set their reclaim policy toRetain
.kubectl -n <namespace> patch pv <pv-name> \ -p '{"spec":{"persistentVolumeReclaimPolicy": "Retain"}}'
Verify that the PersistentVolume has the correct reclaim policy,
Retain
.kubectl get pv
Issue: CFK fails to start up with an error about the ClusterLink CR¶
CFK fails to startup with the following error about ClusterLink CR:
13 W0819 02:00:08.878157 1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1beta1.ClusterLink: json: cannot unmarshal string into Go struct field ClusterLinkStatus.items.status.mirrorTopics of type v1beta1.MirrorTopicStatus
14 E0819 02:00:08.878189 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.ClusterLink: failed to list *v1beta1.ClusterLink: json: cannot unmarshal string into Go struct field ClusterLinkStatus.items.status.mirrorTopics of type v1beta1.MirrorTopicStatus
A breaking change was introduced in CFK 2.4.0 in the ClusterLink status. Take the following steps to fix the issue:
Remove the status from ClusterLink CRD:
kubectl patch crd clusterlinks.platform.confluent.io --type='json' \ --patch='[{"op": "replace", "path": "/spec/versions/0/schema/openAPIV3Schema/properties/status/properties", "value": ''}]'
Confirm that the status is empty in the ClusterLink CR and that CFK comes up without errors:
kubectl -n <namespace> get clusterlink <name> -oyaml
The output should look similar to the following:
apiVersion: v1 items: - apiVersion: platform.confluent.io/v1beta1 kind: ClusterLink spec: destinationKafkaCluster: kafkaRestClassRef: name: destination-kafka-rest namespace: destination mirrorTopics: - name: demo-cl sourceKafkaCluster: bootstrapEndpoint: kafka.origin.svc.cluster.local:9071 kafkaRestClassRef: name: origin-kafka-rest namespace: origin status: {}
Apply the latest CRDs again.
kubectl apply -f <CFK home>/confluent-for-kubernetes/crds
Alternatively, you can apply only the ClusterLink CRD:
kubectl apply -f <CFK home>/confluent-for-kubernetes/crds/platform.confluent.io_clusterlinks.yaml
Do a force reconcile of all the ClusterLink CRs to update the status:
kubectl -n <namespace> annotate clusterlink <name> \ platform.confluent.io/force-reconcile=true
Alternatively, you can restart the CFK pod instead of doing the previous step, but that would reconcile all the Confluent Platform resources.
Now, the ClusterLink CR should have the correct status. For example:
mirrorTopics: demo-cl: replicationFactor: 3 sourceTopicName: demo-cl status: ACTIVE numMirrorTopics: 1 sourceKafkaClusterID: I7lBtEB6Qxq5-CaE0b202g state: CREATED
Issue: TLS-enabled ksqlDB fails to start with a crash loop¶
Currently, ksqlDB does not support using different trust stores for the REST client connections and the connections from ksqlDB to Kafka.
ignoreTrustStoreConfig
used for dependencies on a such component that does not
decouple HTTPS and TLS properties does not work if the ksqlDB REST endpoint
has TLS enabled and if the Kafka dependency uses TLS.
For example, a ksqlDB with TLS enabled, auto-generate certificates, and
ignoreTrustStoreConfig
set to true
, will fail with a crash loop. You
will see the following error in the ksqlDB log:
sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Workaround: Set ignoreTrustStoreConfig: false
in the ksqlDB CR and
reapply the CR to re-start ksqlDB.
Appendix: Sample support bundle contents¶
The following is a sample list of logs generated for a support bundle.
tar -xzf support-bundle-ns-confluent.tar.gz
ls -al -R support-bundle
support-bundle:
total 32
drwxr-xr-x 9 omar operator 4096 Oct 3 22:54 application-resources
drwxr-xr-x 9 omar operator 4096 Oct 3 22:54 cluster-resources
-rw-r--r-- 1 omar operator 316 Oct 3 20:26 event.txt
-rw-r--r-- 1 omar operator 2279 Oct 3 20:26 event.yaml
-rw-r--r-- 1 omar operator 215 Oct 3 20:26 k8s-version.yaml
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 operator
support-bundle/application-resources:
total 36
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 clusterlink
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 confluentrolebinding
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 connector
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 kafkarestclass
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 kafkatopic
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 schema
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 schemaexporter
support-bundle/application-resources/clusterlink:
total 12
-rw-r--r-- 1 omar operator 2909 Oct 3 20:26 clusterlink.yaml
support-bundle/application-resources/confluentrolebinding:
total 16
-rw-r--r-- 1 omar operator 4602 Oct 3 20:26 rolebinding.yaml
support-bundle/application-resources/connector:
total 16
-rw-r--r-- 1 omar operator 5722 Oct 3 20:26 connector.yaml
support-bundle/application-resources/kafkarestclass:
total 12
-rw-r--r-- 1 omar operator 1919 Oct 3 20:26 kafkarestclass.yaml
support-bundle/application-resources/kafkatopic:
total 12
-rw-r--r-- 1 omar operator 2354 Oct 3 20:26 kafkatopic.yaml
support-bundle/application-resources/schema:
total 12
-rw-r--r-- 1 omar operator 1754 Oct 3 20:26 schema.yaml
support-bundle/application-resources/schemaexporter:
total 12
-rw-r--r-- 1 omar operator 2359 Oct 3 20:26 schema-exporter.yaml
support-bundle/cluster-resources:
total 36
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 connect
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 controlcenter
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 kafka
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 kafkarestproxy
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 ksqldb
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 schemaregistry
drwxr-xr-x 2 omar operator 4096 Oct 3 22:54 zookeeper
support-bundle/cluster-resources/connect:
total 1264
-rw-r--r-- 1 omar operator 23557 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 678410 Oct 3 20:26 connect-0.log
-rw-r--r-- 1 omar operator 548167 Oct 3 20:26 connect-1.log
-rw-r--r-- 1 omar operator 7008 Oct 3 20:26 connect.yaml
-rw-r--r-- 1 omar operator 24311 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/controlcenter:
total 3800
-rw-r--r-- 1 omar operator 21845 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 3821198 Oct 3 20:26 controlcenter-0.log
-rw-r--r-- 1 omar operator 7397 Oct 3 20:26 controlcenter.yaml
-rw-r--r-- 1 omar operator 25448 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/kafka:
total 11012
-rw-r--r-- 1 omar operator 48563 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 1520143 Oct 3 20:26 kafka-0.log
-rw-r--r-- 1 omar operator 2676213 Oct 3 20:26 kafka-1.log
-rw-r--r-- 1 omar operator 6970036 Oct 3 20:26 kafka-2.log
-rw-r--r-- 1 omar operator 13336 Oct 3 20:26 kafka.yaml
-rw-r--r-- 1 omar operator 27204 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/kafkarestproxy:
total 2452
-rw-r--r-- 1 omar operator 17272 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 2447700 Oct 3 20:26 kafkarestproxy-0.log
-rw-r--r-- 1 omar operator 6851 Oct 3 20:26 kafkarestproxy.yaml
-rw-r--r-- 1 omar operator 24267 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/ksqldb:
total 820
-rw-r--r-- 1 omar operator 17491 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 342481 Oct 3 20:26 ksqldb-0.log
-rw-r--r-- 1 omar operator 429878 Oct 3 20:26 ksqldb-1.log
-rw-r--r-- 1 omar operator 6712 Oct 3 20:26 ksqldb.yaml
-rw-r--r-- 1 omar operator 24991 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/schemaregistry:
total 4688
-rw-r--r-- 1 omar operator 17291 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 4732891 Oct 3 20:26 schemaregistry-0.log
-rw-r--r-- 1 omar operator 6718 Oct 3 20:26 schemaregistry.yaml
-rw-r--r-- 1 omar operator 24610 Oct 3 20:26 statefulset.yaml
support-bundle/cluster-resources/zookeeper:
total 504
-rw-r--r-- 1 omar operator 18094 Oct 3 20:26 configmaps.yaml
-rw-r--r-- 1 omar operator 25011 Oct 3 20:26 statefulset.yaml
-rw-r--r-- 1 omar operator 147017 Oct 3 20:26 zookeeper-0.log
-rw-r--r-- 1 omar operator 156360 Oct 3 20:26 zookeeper-1.log
-rw-r--r-- 1 omar operator 139330 Oct 3 20:26 zookeeper-2.log
-rw-r--r-- 1 omar operator 4105 Oct 3 20:26 zookeeper.yaml
support-bundle/operator:
total 1924
-rw-r--r-- 1 omar operator 1947953 Oct 3 20:26 confluent-operator-589cccd45c-kgbct.log
-rw-r--r-- 1 omar operator 8763 Oct 3 20:26 deployment.yaml