Troubleshoot Confluent for Kubernetes¶

This topic provides general information about troubleshooting your Confluent Platform deployment, the tools available, and some of the issues you might encounter and how to troubleshoot the issues.

Support bundle¶

Confluent for Kubernetes (CFK) aggregates information, such as events, Kubernetes versions, the log, and Confluent APIs status in a tar.gz file for you to upload to the Confluent Support site.

You can create a support bundle to provide Confluent all of the required information for debugging.

Use the latest version of the Confluent plugin tool to generate a support bundle to capture the logs and configs for the new features that are available in the new release of CFK.

The CFK support bundles are not currently supported on Microsoft Windows.

The following information is collected in a support bundle:

For the Kubernetes cluster (in the top-level directory):
- The Kubernetes events (kubectl get events)
- The Kubernetes server version (kubectl version)
For each Confluent Platform cluster component (in the /clusters-resources/component directory):
- The config maps generated by CFK for the component
- The pod logs for each pod in the component service (kubectl logs <pod-names>)
- The custom resource object for the component (kubectl get <component_cr_type> <component_name>)
- The StatefulSet object generated by CFK for the component (kubectl get sts <component-sts-name>)
For each Confluent Platform application component (in the /application-resources/component directory):
- The custom resource object for the component (kubectl get <component_cr_type> <component_name>)
For the Confluent for Kubernetes Operator deployment (in the /operator directory):
- The CFK Operator pod logs (kubectl logs confluent-operator)
- The deployment object generated for the CFK Operator (kubectl get deployment confluent-operator -oyaml)

For a sample support bundle, see Sample support bundle contents.

Create a support bundle¶

To create a support bundle:

Install the Confluent plugin as describe in Confluent plugin.
Run the following command to create a support bundle:
```
kubectl confluent support-bundle --namespace <namespace>
```
Copy
When you have large logs with debug/trace logging enabled, you can capture the log files with the follow-logs-duration option. Specify the duration (in seconds) to follow the pod logs:
```
kubectl confluent support-bundle --namespace <namespace> --follow-logs-duration <duration in seconds to follow the logs>
```
Copy

To see other flags you can use to customize your support bundle, run:

kubectl confluent support-bundle -h
Copy

Sample support bundle contents¶

Extract the contents of the CFK support bundle using the following command:

tar -xzf support-bundle-ns-confluent.tar.gz
Copy

The following is a sample list of log files that are extracted into the support-bundle sub-directory under where you run the above tar command:

├── support-bundle-ns-confluent
│   ├── application-resources
│   │   ├── clusterlink
│   │   │   └── democlusterlink.yaml
│   │   ├── confluentrolebinding
│   │   │   ├── internal-connect-0.yaml
│   │   │   ├── internal-connect-1.yaml
│   │   │   ├── internal-connect-2.yaml
│   │   │   ├── internal-controlcenter-0.yaml
│   │   │   ├── internal-kafkarestproxy-0.yaml
│   │   │   ├── internal-kafkarestproxy-1.yaml
│   │   │   ├── internal-ksqldb-0.yaml
│   │   │   ├── internal-ksqldb-1.yaml
│   │   │   ├── internal-ksqldb-2.yaml
│   │   │   ├── internal-schemaregistry-0.yaml
│   │   │   └── internal-schemaregistry-1.yaml
│   │   ├── connector
│   │   │   └── democonnector.yaml
│   │   ├── kafkarestclass
│   │   │   └── default.yaml
│   │   ├── kafkatopic
│   │   │   └── demotopic.yaml
│   │   └── kraftmigrationjob
│   │       └── kraft-migration.yaml
│   ├── cluster-resources
│   │   ├── connect
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── connect.yaml
│   │   │   ├── pdb
│   │   │   │   └── pdb-connect.yaml
│   │   │   ├── pods
│   │   │   │   ├── pod-connect-0.yaml
│   │   │   │   └── pod-connect-1.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   ├── controlcenter
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── controlcenter.yaml
│   │   │   ├── pods
│   │   │   │   └── pod-controlcenter-0.yaml
│   │   │   ├── pv
│   │   │   │   └── pv-controlcenter-0.yaml
│   │   │   ├── pvc
│   │   │   │   └── pvc-controlcenter-0.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   ├── kafka
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── kafka.yaml
│   │   │   ├── logs
│   │   │   │   ├── kafka-0.log
│   │   │   │   ├── kafka-1.log
│   │   │   │   └── kafka-2.log
│   │   │   ├── pdb
│   │   │   │   └── pdb-kafka.yaml
│   │   │   ├── pods
│   │   │   │   ├── pod-kafka-0.yaml
│   │   │   │   ├── pod-kafka-1.yaml
│   │   │   │   └── pod-kafka-2.yaml
│   │   │   ├── pv
│   │   │   │   ├── pv-kafka-0.yaml
│   │   │   │   ├── pv-kafka-1.yaml
│   │   │   │   └── pv-kafka-2.yaml
│   │   │   ├── pvc
│   │   │   │   ├── pvc-kafka-0.yaml
│   │   │   │   ├── pvc-kafka-1.yaml
│   │   │   │   └── pvc-kafka-2.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   ├── kafkarestproxy
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── kafkarestproxy.yaml
│   │   │   ├── pods
│   │   │   │   └── pod-kafkarestproxy-0.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   ├── ksqldb
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── ksqldb.yaml
│   │   │   ├── pdb
│   │   │   │   └── pdb-ksqldb.yaml
│   │   │   ├── pods
│   │   │   │   ├── pod-ksqldb-0.yaml
│   │   │   │   └── pod-ksqldb-1.yaml
│   │   │   ├── pv
│   │   │   │   ├── pv-ksqldb-0.yaml
│   │   │   │   └── pv-ksqldb-1.yaml
│   │   │   ├── pvc
│   │   │   │   ├── pvc-ksqldb-0.yaml
│   │   │   │   └── pvc-ksqldb-1.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   ├── schemaregistry
│   │   │   ├── cm
│   │   │   │   └── configmaps.yaml
│   │   │   ├── pdb
│   │   │   │   └── pdb-schemaregistry.yaml
│   │   │   ├── pods
│   │   │   │   └── pod-schemaregistry-0.yaml
│   │   │   ├── schemaregistry.yaml
│   │   │   ├── sts
│   │   │   │   └── statefulset.yaml
│   │   │   └── svc
│   │   │       └── services.yaml
│   │   └── zookeeper
│   │       ├── cm
│   │       │   └── configmaps.yaml
│   │       ├── logs
│   │       │   ├── zookeeper-0.log
│   │       │   └── zookeeper-1.log
│   │       ├── pdb
│   │       │   └── pdb-zookeeper.yaml
│   │       ├── pods
│   │       │   ├── pod-zookeeper-0.yaml
│   │       │   ├── pod-zookeeper-1.yaml
│   │       │   └── pod-zookeeper-2.yaml
│   │       ├── pv
│   │       │   ├── pv-zookeeper-0.yaml
│   │       │   └── pv-zookeeper-1.yaml
│   │       ├── pvc
│   │       │   ├── pvc-zookeeper-0.yaml
│   │       │   ├── pvc-zookeeper-1.yaml
│   │       │   └── pvc-zookeeper-2.yaml
│   │       ├── sts
│   │       │   └── statefulset.yaml
│   │       ├── svc
│   │       │   └── services.yaml
│   │       └── zookeeper.yaml
│   ├── event.txt
│   ├── event.yaml
│   ├── k8s-version.yaml
│   ├── misc
│   │   ├── kubectl-all-confluent.txt
│   │   ├── kubectl-controller-revisions-confluent.txt
│   │   └── kubectl-endpoints-confluent.txt
│   └── operator
│       ├── deployment.yaml
│       └── logs
│           └── confluent-operator-6f587554c7-abcc7.log
Copy

Heap dump¶

If troubleshooting memory-related issues in CFK, such as out-of-memory caused a memory leak, you may need to generate the heap dumps.

To capture the heap dumps, you can install the confluent kubectl plugins, and use the heapdump option:

kubectl confluent cluster <cluster_name> heapdump \
   --pod-names <pod names> \
   --out-dir <output directory> \
   --namespace <namespace>
Copy

The command takes the following optional flags:

pod-names: The default values are all pods in the cluster.
out-dir: The default value is ./.
namespace: The namespace where the Confluent Platform pod/cluster is deployed.

The heap dump output is placed in:

<out-dir>/heapdump-<cluster-name>-all-ns-<namespace>.tar.gz
Copy

Example plugin commands:

To generate the the heap dump for all Kafka cluster pods:

kubectl confluent cluster kafka heapdump \
  --out-dir $OUT_DIR \
  --namespace $NAMESPACE
Copy

To generate the heap dump for the kafka-0 Kafka pod in the Kafka cluster:

kubectl confluent cluster kafka heapdump \
  --pod-names kafka-0 \
  --out-dir $OUT_DIR \
  --namespace $NAMESPACE
Copy

For an alternate ways to generate the heap dump using jcmd, see the Confluent Support Knowledgebase article (requires login).

Thread dump¶

If troubleshooting issues in CFK, such as slowness, performance, high CPU usage, or hanging issues, you can take thread dumps over time at intervals to analyze the pattern and thread progress.

To capture and debug the thread dumps, you can install the confluent kubectl plugins, and run the threaddump plugin command:

kubectl confluent cluster <cluster_name> threaddump \
   --pod-names <pod names> \
   --interval-seconds <seconds> \
   --run-count <run count> \
   --out-dir <output directory> \
   --namespace <namespace>
Copy

The command takes the following optional flags:

pod-names: The default values are all pods in the cluster.
interval-seconds: The supported range is: 60 < interval-seconds < 300. The default value is 60.
run-count: The supported range is 1 < run-count < 10. The default value is 1.
out-dir: The default value is ./.
namespace: The namespace where the Confluent Platform pod/cluster is deployed.

The outputs are placed in:

<out-dir>/threaddump-<cluster-name>-all-ns-<namespace>.tar.gz
Copy

The following are example usages of the CFK dump plugins.

Thread dump of the CFK operator pod in the confluent namespace:

kubectl confluent operator threaddump --namespace confluent
Copy

Thread dump of all pods in a Kafka cluster in operator namespace:
```
kubectl confluent cluster kafka threaddump
```
Copy

Thread dump of a specific pod in a Kafka cluster for 5 times every 60 seconds:

kubectl confluent cluster kafka threaddump \
   --pod-names kafka-0 \
   --interval-seconds 60 \
   --run-count 5 \
   --out-dir $OUT_DIR \
   --namespace $NAMESPACE
Copy

Thread dump of all Connect pods:

kubectl confluent cluster connect threaddump --namespace confluent
Copy

Thread dump of a specific pod in the Connect cluster 3 times at 60 seconds intervals in a custom directory:

kubectl confluent cluster connect threaddump \
  --namespace confluent \
  --out-dir /tmp \
  --interval-seconds 60 \
  --run-count 3 \
  --pod-names connect-0
Copy

Thread dump of all Connect pods 3 times at 60 seconds intervals in the custom directory:

kubectl confluent cluster connect threaddump \
  --namespace confluent \
  --out-dir /tmp \
  --interval-seconds 60 \
  --run-count 3
Copy

For an alternate way to generate the thread dump using jstack, see the Confluent Support Knowledgebase article (requires login).

Logs¶

Use the command below to view the logs for a currently running pod.

Logs are sent directly to STDOUT for each pod.

kubectl logs <pod-names> -n <namespace>
Copy

Additionally, in component custom resources, you can change the log level to DEBUG using the Configuration overrides feature. For example, add the following in your Kafka CR to get more details in the Kafka logs:

spec:
  configOverrides:
    log4j:
      - log4j.rootLogger=DEBUG, stdout
Copy

View logs of crashed pods¶

If a container crashed and was replaced, you can retrieve the logs from its previous instance using the --previous (-p) flag to investigate the root cause of the pod crash.

kubectl logs --previous <pod-name>  --namespace <namespace>
Copy

Metrics¶

JMX metrics are available on port 7203 of each pod.
Jolokia (a REST interface for JMX metrics) is available on port 7777 of each pod.

Debug¶

There are several types of problems that can go wrong while using Confluent for Kubernetes (CFK):

A problem happens while deploying CFK.
A problem exists at the infrastructure level.

Something has gone wrong at the Kubernetes layer.
A problem exists at the application level.

The infrastructure is fine but something has gone wrong with Confluent Platform itself. Typically, this is caused by how Confluent Platform components were configured.

To debug deployment problems, run the Helm install command with the --set debug="true" to enable verbose output:

helm upgrade --install confluent-operator \
  confluentinc/confluent-for-kubernetes \
  --namespace <namespace> \
  --set debug="true"
Copy

Look for Kubernetes issues first, then debug Confluent Platform.

Check for potential Kubernetes errors by entering the following command:
```
kubectl get events -n <namespace>
```
Copy
Check for a specific resource issue, enter the following command (using the resource type example pods):
```
kubectl describe pods <podname> -n <namespace>
```
Copy
If everything looks okay after running the commands above, check the individual pod logs using the following command:
```
kubectl logs <pod name> -n <namespace>
```
Copy
Confluent Platform containers are configured so application logs are printed to STDOUT. The logs can be read directly with this command. If there is anything wrong at the application level, such as an invalid configuration, this will be evident in the logs.

If a pod has been replaced because it crashed and you want to check the previous pod’s logs, add --previous to the end of the command above.

Troubleshooting problems caused by the datacenter infrastructure, such as virtual machine (VM) firewall rules, DNS configuration, etc., should be resolved by infrastructure system administrator.

Troubleshoot known issues¶

This section describes a few of the potential issues you might encounter while using CFK and presents the steps to troubleshoot the issues.

Note that the examples in this section assume the default namespace you set in Create a namespace for CFK.

Issue: CFK Helm charts are not found¶

When the local cache of the CFK charts are out of date, you get a “not found” error from the helm commands install or update CFK.

Solution: Update the latest information about the CFK charts from the Helm repositories:

helm repo update
Copy

Issue: Unable to install Kafka on Kubernetes 1.23 or higher with AWS EBS¶

EBS CSI driver is required to make CFK and Kafka work on Kubernetes 1.23 or higher.

Solution: Enable and set up the EBS-CSI add-on as described in Managing the Amazon EBS CSI driver as an Amazon EKS add-on.

Issue: An error returns while applying a CRD during an upgrade¶

As part of CFK upgrade, you need to upgrade Confluent Platform custom resource definitions (CRDs) using the following command.

kubectl apply -f confluent-for-kubernetes/crds/
Copy

You might get an error message similar to the following from the command:

The CustomResourceDefinition "kafkas.platform.confluent.io" is invalid:
metadata.annotations: Too long: must have at most 262144 bytes make: ***
[install-crds] Error 1
Copy

Solution: Run the following commands to apply the CRD:

kubectl apply --server-side=true --force-conflicts -f <CRD>
Copy

If running kubectl apply with the --server-side=true flag returns an error similar to the below:

Apply failed with 1 conflict: conflict with "helm" using
apiextensions.k8s.io/v1: .spec.versions Please review the fields above--they
currently have other managers.
Copy

Run kubectl apply with an additional flag, --force-conflicts:

kubectl apply --server-side=true --force-conflicts -f <CRD>
Copy

Issue: Confluent Control Center cannot use auto-generated certificates for MDS or Confluent Cloud Schema Registry¶

When TLS is enabled, and when Confluent Control Center uses a different TLS certificate to communicate with MDS or Confluent Cloud Schema Registry, Control Center cannot use an auto-generated TLS certificate to connect to MDS or Confluent Cloud Schema Registry.

Solution: To encrypt Confluent Control Center traffic to MDS or Confluent Cloud Schema Registry, provide custom TLS certificates:

Add to the Confluent Control Center TLS secret a custom truststore that includes the root CA to trust Confluent Cloud Schema Registry or MDS.

Use the above secret name in the Confluent Control Center custom resource (CR):

spec:
  dependencies:
    mds:
      tls:
        secretRef: <custom Root CA>
Copy

Disable auto-generated certs and provide the RootCA in Confluent Control Center custom resource:
```
spec:
  tls:
    autoGeneratedCerts: false
    secretRef: <custom Root CA>
```
Copy
Deploy or redeploy Confluent Control Center.

Issue: ksqlDB cannot use auto-generated certificates for Confluent Cloud¶

When ksqlDB communicates with Confluent Cloud, ksqlDB cannot use an auto-generated TLS certificate.

Solution: To encrypt ksqlDB traffic to Confluent Cloud, provide custom TLS certificates:

Provide a custom truststore that includes the Let’s Encrypt Root CA.

Disable auto-generated certs and provide the RootCA in ksqlDB custom resource:

spec:
  tls:
    autoGeneratedCerts: false
    secretRef: <custom Root CA>
Copy

Deploy or redeploy ksqlDB.

Issue: Error for running Schema Registry with mismatching replication factor¶

The Schema Registry deployed with Confluent for Kubernetes uses the default replication factor of 3 for internal topics. When deploying Schema Registry and Kafka simultaneously and when Kafka brokers have not fully started, Schema Registry creates topics with fewer than 3 replication factor, and later Schema Registry starts failing with an error.

For example, if Kafka has one broker up, and Schema Registry is trying to come up, it will create a topic with replication factor of 1. Subsequently, when all 3 brokers come up, Schema Registry will complain that there are topics that were only created with replication factor 1 and will fail to start.

Solution: Delete the Schema Registry deployment and re-deploy once Kafka is fully up. Or reconfigure the topics to align with the Schema Registry default replication factor of 3.

Issue: Errors with container process and directory permissions when deploying Confluent Platform 6.2.x¶

You might see a permission error when starting ZooKeeper and other Confluent Platform 6.2.x components with CFK. For example:

Error: failed to start container "zookeeper": Error response from daemon: OCI
runtime create failed: container_linux.go:367: starting container process
caused: chdir to cwd ("/home/appuser") set in config.json failed: permission
denied: unknown
Copy

Confluent Platform 6.2.x configures the default user appuser to be mapped to 1000 with base directory /home/appuser.

Solution: Set the Pod Security Context to match the new Confluent Platform configuration.

Change the podSecurityContext config for all the Confluent Platform component custom resources (CRs):

spec:
  podTemplate:
    podSecurityContext:
      fsGroup: 1000
      runAsUser: 1000
      runAsNonRoot: true
Copy

Apply the CR changes for each component:
```
kubectl apply -f <component CR>
```
Copy

Issue: Unable to delete Kubernetes resources¶

Kubernetes resources cannot be deleted, for example, if CFK pod gets deleted before other resources, if CFK can’t delete resources, or if the namespace is in termination state.

Solution: Remove the finalizer using the following command:

kubectl get <resource> --no-headers | \
  awk '{print $1 }' | \
  xargs kubectl patch <resource> -p '{"metadata":{"finalizers":[]}}' \
  --type=merge
Copy

Issue: ConfluentRoleBindings stuck in DELETING¶

The ConfluentRolebindings custom resources (CRs) can be stuck in the DELETING state if associated Kafka cluster is removed.

Solution: Manually remove the finalizer for those ConfluentRolebindings CRs as shown in the following example:

Check the status of the ConfluentRolebindings custom resources:

kubectl get cfrb
Copy

The output should have the DELETING status:

NAME                         STATUS     KAFKACLUSTERID           PRINCIPAL        ROLE
c3-connect-operator-7gffem   DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          SystemAdmin
c3-ksql-operator-7gffem      DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          ResourceOwner
c3-operator-7gffem           DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          ClusterAdmin
c3-sr-operator-7gffem        DELETING   8itASw0_S6qDfdl72b7Uyg   User:c3          SystemAdmin
connect-operator-7gffem-0    DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SystemAdmin
connect-operator-7gffem-1    DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SystemAdmin
internal-connect-0           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     SecurityAdmin
internal-connect-1           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     ResourceOwner
internal-connect-2           DELETING   8itASw0_S6qDfdl72b7Uyg   User:connect     DeveloperWrite
Copy

Remove the finalizer of each ConfluentRolebindings CR:

for rb in $(kubectl get cfrb --no-headers | grep "DELETING" | awk '{print $1}'); \
  do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \
  --type=merge; \
  done
Copy

If you want to remove the finalizer for all the ConfluentRolebindings CRs in a namespace no matter their status, for example, when the namespace stuck in the terminating state and need to clean up all the ConfluentRolebindings, run the following command:
```
for rb in $(kubectl get cfrb --no-headers -ojsonpath='{.items[*].metadata.name}'); \
  do kubectl patch cfrb $rb -p '{"metadata":{"finalizers":[]}}' \
  --type=merge; \
  done
```
Copy

Issue: KafkaTopic is stuck in DELETING/DELETE_REQUESTED state¶

When you delete a Kafka topic (kubectl delete kafkatopic <topic-name>) uses Kubernetes finalizer feature to remove the topic from the destination Kafka cluster. If the finalizer fails to delete because of network issue or unavailable of Kafka clusters (deleted), the kubectl delete command will hang.

Solution: Patch the kafkatopic with following commands to remove the topic resource:

kubectl patch kafkatopic <topic-name> -p '{"metadata":{"finalizers":[]}}' \
  --type=merge
Copy

Delete Confluent Platform component pods¶

To manually delete Confluent Platform component pods, run the following command for each component.

Warning

This operation will completely bring down the component. Use caution when deleting the Kafka pod as it will trigger data loss.

kubectl delete pod -l platform.confluent.io/type=<cr-type>
Copy

<cr-type> is one of the following: kafka, zookeeper, schemaregistry, ksqldb, connect, controlcenter

If a pod is already in the crashloopback state, the CFK will not honor the changes until pod goes back to the running state. You can use --force --grace-period=0 with the above command.

Block Kubernetes object reconciliation¶

CFK includes multiple controllers. It’s a controller’s job to ensure that, for any given object, the actual state of the world (both the cluster state, and potentially external state like running containers for Kubelet or loadbalancers for a cloud provider) matches the desired state in the object. This process is called reconciling.

Solution:

To block reconciliation (often needed when upgrading), use the following command to add the annotation:
```
kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile=true
```
Copy
<cr-type> is one of the following: kafka, zookeeper, schemaregistry, ksqldb, connect, controlcenter
To remove this annotation to stop blocking reconciliation. If you do not remove, subsequent changes on the CustomResource will be ignored:
```
kubectl annotate <cr-type> <cluster_name> platform.confluent.io/block-reconcile-
```
Copy
To force trigger the reconciliation, add the following annotation:
```
kubectl annotate <cluster-type> <cluster_name> platform.confluent.io/force-reconcile=true
```
Copy
After the reconcile is triggered, this annotation will be disabled automatically. This command is useful in situation where the auto-generated certificate is expired and requires the CFK to be notified to create a new certificate.

Warning: Operation cannot be fulfilled, object has been modified¶

You can ignore the following Warning as CFK will automatically rework to apply the changes.

Operation cannot be fulfilled xxxx, the object has been modified please apply
your changes to the latest version and try again.
Copy

Solution: In most scenarios, these are benign and will go away. If you continue seeing the same type of Warning repeatedly, create a support ticket for further investigation.

Issue: I have a StorageClass that does not have the reclaimPolicy set to retain¶

The StorageClass(SC) of the PersistentVolume that CFK uses must be configured with reclaimPolicy: Retain. If the StorageClass and its PersistentVolumes have been already created, the StorageClass cannot be changed. In this case, you have to patch the PersistentVolume as shown below.

Solution:

List the PersistentVolumes:
```
kubectl get pv
```
Copy
Check the PersistentVolume names that CFK uses and their reclaim policy in the output.
Change the PersistentVolumes that do NOT have the reclaim policy set to Retain. Patch the PersistentVolumes and set their reclaim policy to Retain.
```
kubectl -n <namespace> patch pv <pv-name> \
  -p '{"spec":{"persistentVolumeReclaimPolicy": "Retain"}}'
```
Copy
Verify that the PersistentVolume has the correct reclaim policy, Retain.
```
kubectl get pv
```
Copy

Issue: CFK fails to start up with an error about the ClusterLink CR¶

CFK fails to startup with the following error about ClusterLink CR:

13 W0819 02:00:08.878157       1 reflector.go:324] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: failed to list *v1beta1.ClusterLink: json: cannot unmarshal string into Go struct field ClusterLinkStatus.items.status.mirrorTopics of        type v1beta1.MirrorTopicStatus
14 E0819 02:00:08.878189       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.24.2/tools/cache/reflector.go:167: Failed to watch *v1beta1.ClusterLink: failed to list *v1beta1.ClusterLink: json: cannot unmarshal string into Go struct field                 ClusterLinkStatus.items.status.mirrorTopics of type v1beta1.MirrorTopicStatus
Copy

A breaking change was introduced in CFK 2.4.0 in the ClusterLink status. Take the following steps to fix the issue:

Remove the status from ClusterLink CRD:

kubectl patch crd clusterlinks.platform.confluent.io --type='json' \
  --patch='[{"op": "replace", "path": "/spec/versions/0/schema/openAPIV3Schema/properties/status/properties", "value": ''}]'
Copy

Confirm that the status is empty in the ClusterLink CR and that CFK comes up without errors:

kubectl -n <namespace> get clusterlink <name> -oyaml
Copy

The output should look similar to the following:

apiVersion: v1
items:
- apiVersion: platform.confluent.io/v1beta1
  kind: ClusterLink
  spec:
    destinationKafkaCluster:
      kafkaRestClassRef:
        name: destination-kafka-rest
        namespace: destination
    mirrorTopics:
    - name: demo-cl
    sourceKafkaCluster:
      bootstrapEndpoint: kafka.origin.svc.cluster.local:9071
      kafkaRestClassRef:
        name: origin-kafka-rest
        namespace: origin
  status: {}
Copy

Apply the latest CRDs again.

kubectl apply -f <CFK home>/confluent-for-kubernetes/crds
Copy

Alternatively, you can apply only the ClusterLink CRD:

kubectl apply -f <CFK home>/confluent-for-kubernetes/crds/platform.confluent.io_clusterlinks.yaml
Copy

Do a force reconcile of all the ClusterLink CRs to update the status:
```
kubectl -n <namespace> annotate clusterlink <name> \
  platform.confluent.io/force-reconcile=true
```
Copy
Alternatively, you can restart the CFK pod instead of doing the previous step, but that would reconcile all the Confluent Platform resources.

Now, the ClusterLink CR should have the correct status. For example:

mirrorTopics:
  demo-cl:
    replicationFactor: 3
    sourceTopicName: demo-cl
    status: ACTIVE
numMirrorTopics: 1
sourceKafkaClusterID: I7lBtEB6Qxq5-CaE0b202g
state: CREATED
Copy

Issue: TLS-enabled ksqlDB fails to start with a crash loop¶

Currently, ksqlDB does not support using different trust stores for the REST client connections and the connections from ksqlDB to Kafka.

ignoreTrustStoreConfig used for dependencies on a such component that does not decouple HTTPS and TLS properties does not work if the ksqlDB REST endpoint has TLS enabled and if the Kafka dependency uses TLS.

For example, a ksqlDB with TLS enabled, auto-generate certificates, and ignoreTrustStoreConfig set to true, will fail with a crash loop. You will see the following error in the ksqlDB log:

sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Copy

Workaround: Set ignoreTrustStoreConfig: false in the ksqlDB CR and reapply the CR to re-start ksqlDB.

Issue: Unable to connect to identity provider with self-signed certificates when using OAuth/OIDC authentication for Kafka¶

The current set of SSL properties in Kafka listeners does not let you connect to your identity provider with self signed certificates. You will receive an error in the Kafka pod:

[2024-04-17 13:33:53,712] ERROR Exiting Kafka due to fatal exception during startup. (kafka.Kafka$)
org.apache.kafka.common.KafkaException: org.apache.kafka.common.KafkaException:
The OAuth validator configuration encountered an error when initializing the VerificationKeyResolver
Copy

Workaround: Use one of the following options.

Specify a custom trust store through JVM arguments in the spec.configOverrides.jvm section of the Kafka custom resource:

kind: Kafka
spec:
  configOverrides:
    jvm:
      - "-Djavax.net.ssl.trustStoreType=JKS"
      - "-Djavax.net.ssl.trustStore=/mnt/jvmtruststore/truststore.jks"
      - "-Djavax.net.ssl.trustStorePassword=mystorepassword"
Copy

Add the trust store to the Kafka listeners as JAAS config, using the parameter unsecuredLoginStringClaim_sub="thePrincipalName".

kind: Kafka
spec:
  configOverrides:
    server:
      - listener.name.controller.oauthbearer.sasl.jaas.config=org.apache.kafka.common.security.oauthbearer.OAuthBearerLoginModule required clientId="${file:/mnt/secrets/oauth-jass/oauth.txt:clientId}" clientSecret="${file:/mnt/secrets/oauth-jass/oauth.txt:clientSecret}" refresh_ms="3000" ssl.truststore.location="/mnt/sslcerts/truststore.jks" ssl.truststore.password="mystorepassword" unsecuredLoginStringClaim_sub="thePrincipalName";
Copy

Alternatively, you can use a secret to pass the JAAS config passthrough information as described in Create server-side SASL/PLAIN credentials using JAAS config pass-through for Kafka.

Issue: Kafka cluster ID mismatch between Kafka/KRaft CR status and the cluster¶

After a configuration change, the cluster ID shown in the Kafka and KRaft CR status: and in the CR spec: do not match.

Workaround:

The first recommended workaround is an automatic fix with an upgrade to CFK a release 2.11.0, 2.10.1, or 2.9.5.

If the upgrade is not possible, take the following steps to patch the status objects of KRaft and Kafka.

Verify if you have mismatching cluster IDs.
1. Fetch the clusterID value from the Kafka status:
```
kubectl get kafka <kafka-cr-name> -n <namespace> -oyaml | grep "clusterID"
```
  Copy
2. Fetch the correct cluster ID from the Kafka pod. For example:
  1. Exec into one of the Kafka pods.
  2. Fetch the cluster ID from meta.properties:
    cat /mnt/data/data0/logs/meta.properties
    Copy
  3. Retrieve the cluster.id value from the output:
    #Wed Mar 05 12:55:23 GMT 2025 node.id=1 directory.id=pMP24m7YPMhy_F0e9ag_Wg version=1 cluster.id=f66a6843-54f1-4af8-b3Q
    Copy

If the cluster ID from the Kafka status is different from the cluster ID in meta.properties, use the correct cluster ID in meta.properties to set the cluster ID in the status.

Apply the block reconcile annotation:

kubectl annotate kraftcontroller <kraftcontroller-cr-name> \
  -n <namespace> platform.confluent.io/block-reconcile=true
Copy

kubectl annotate kafka <kafka-cr-name> \
  -n <namespace> platform.confluent.io/block-reconcile=true
Copy

Patch the status field:

kubectl patch kraftcontroller <kraftcontroller-cr-name> \
  -n <namespace> --type=merge --subresource status \
  --patch 'status: {clusterID: <Correct-cluster-id-here>}'
Copy

kubectl patch kafka  <kafka-cr-name> \
  -n <namespace> --type=merge --subresource status \
  --patch 'status: {clusterID: <Correct-cluster-id-here>}'
Copy

Remove the block reconcile:

kubectl annotate kraftcontroller <kraftcontroller-cr-name>  \
  -n <namespace> platform.confluent.io/block-reconcile-
Copy

kubectl annotate kafka  <kafka-cr-name> \
  -n <namespace> platform.confluent.io/block-reconcile-
Copy

Fixed Versions: This issue is fixed in CFK 2.11.0, 2.10.1, and 2.9.5.

Issue: Kafka brokers fail to start when configuring OAuth/OIDC authentication¶

OAuth listeners require unsecuredLoginStringClaim_sub in the SASL JAAS configuration to communicate with the token endpoint. CFK automatically creates the setting only for the token listener.