Troubleshoot Key Policy Issues for Self-Managed Encryption Keys

This guide helps you identify, diagnose, and resolve key policy issues that can affect your Confluent Cloud clusters using self-managed encryption keys (BYOK). Key policy misconfigurations can cause cluster creation failures, service disruptions, or access issues.

Common Key Policy Issues

The following table describes common key policy issues and their symptoms:

Issue

Symptoms

Typical Causes

Cluster creation fails

Error message during cluster creation: “encryption key is not valid or not authorized”

Missing or incorrect Confluent permissions in key policy

Cluster stops responding

Cluster becomes unavailable, producers/consumers fail

Key policy was modified to remove Confluent access

Key rotation fails

Automatic key rotation doesn’t work, error messages in cloud provider logs

Insufficient permissions for key management operations

Access denied errors

“AccessDenied” or “UnauthorizedOperation” errors in cluster logs

Expired credentials, incorrect ARNs, or revoked permissions

Diagnose Key Policy Issues

Follow these steps to diagnose key policy issues:

Step 1: Check cluster status

  1. Log in to the Confluent Cloud Console at https://confluent.cloud/.

  2. Navigate to your cluster and check the cluster status and health indicators.

  3. Look for any error messages or warnings related to encryption or key access.

Step 2: Review cloud provider logs

Check your cloud provider’s logs for key-related errors:

Use AWS CloudTrail to review recent KMS API events:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=kms.amazonaws.com \
  --start-time "$(date -v-1H +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date +%Y-%m-%dT%H:%M:%S)"
aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventSource,AttributeValue=kms.amazonaws.com \
  --start-time "$(date -d '1 hour ago' +%Y-%m-%dT%H:%M:%S)" \
  --end-time "$(date +%Y-%m-%dT%H:%M:%S)"

Look for AccessDenied, InvalidKeyUsage, or KMSAccessDenied errors. For details, see Logging AWS KMS API calls with AWS CloudTrail.

Use Azure Activity Log to review Key Vault access attempts:

az monitor activity-log list \
  --resource-group <your-resource-group> \
  # For Linux (GNU date):
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  # For macOS/BSD (BSD date):
  # --start-time $(date -v -1H -u +%Y-%m-%dT%H:%M:%SZ) \
  --query "[?contains(resourceId, 'Microsoft.KeyVault')]"

Look for Forbidden, Unauthorized, or KeyVaultAccessDenied errors. For detailed monitoring guidance, see Monitoring and alerting for Azure Key Vault.

Use Google Cloud Logging to review KMS access attempts:

gcloud logging read \
  "logName=projects/<project>/logs/cloudaudit.googleapis.com%2Factivity AND \
   protoPayload.serviceName=cloudkms.googleapis.com AND \
   protoPayload.resourceName=projects/<project>/locations/<location>/keyRings/<keyring>/cryptoKeys/<key>" \
  --since=1h

Look for PERMISSION_DENIED, INVALID_ARGUMENT, or FAILED_PRECONDITION errors. For detailed logging guidance, see Viewing Cloud KMS logs.

Step 3: Validate key policy configuration

Use the validation checklist in Best Practices for Using Self-Managed Encryption Keys to verify your key policy configuration is correct for your cloud provider.

Resolve Key Policy Issues

Fix missing or incorrect permissions

If your cluster creation fails or stops working due to missing permissions:

Review and update your KMS key policy to include the required Confluent permissions. You can view the current policy using the AWS CLI:

aws kms get-key-policy \
  --key-id <your-key-id> \
  --policy-name default

Ensure both required permission statements are present:

{
  "Sid": "Allow Confluent account(s) to use the key",
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "arn:aws:iam::<confluent-account-id>:role/<confluent-role>"
    ]
  },
  "Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*",
    "kms:DescribeKey"
  ],
  "Resource": "*"
},
{
  "Sid": "Allow Confluent account(s) to attach persistent resources",
  "Effect": "Allow",
  "Principal": {
    "AWS": [
      "arn:aws:iam::<confluent-account-id>:role/<confluent-role>"
    ]
  },
  "Action": [
    "kms:CreateGrant",
    "kms:ListGrants",
    "kms:RevokeGrant"
  ],
  "Resource": "*"
}

If statements are missing or ARNs are incorrect, update your key policy using the AWS CLI:

aws kms put-key-policy \
  --key-id <your-key-id> \
  --policy-name default \
  --policy file://updated-policy.json

For detailed steps on modifying key policies, see Changing a key policy in the AWS documentation.

Verify the Confluent service principal has the required role assignments using the Azure CLI:

az role assignment list \
  --assignee <confluent-service-principal-id> \
  --resource-group <your-resource-group>

The service principal should have these roles:

  • Key Vault Crypto Service Encryption User

  • Key Vault Reader

If roles are missing, add them using:

az role assignment create \
  --assignee <confluent-service-principal-id> \
  --role "Key Vault Crypto Service Encryption User" \
  --scope <key-vault-resource-id>

For detailed steps on managing Key Vault access, see Provide access to Key Vault keys, certificates, and secrets with an Azure role-based access control.

Verify the Confluent Google Group has the required permissions using the Google Cloud CLI:

gcloud kms keys get-iam-policy <key-name> \
  --location=<location> \
  --keyring=<keyring>

If the Confluent Google Group ID is missing, add it with the custom role:

gcloud kms keys add-iam-policy-binding <key-name> \
  --location=<location> \
  --keyring=<keyring> \
  --member="group:<confluent-google-group-id>" \
  --role="projects/<project>/roles/<custom-role-name>"

The custom role should include:

  • cloudkms.cryptoKeyVersions.useToDecrypt

  • cloudkms.cryptoKeyVersions.useToEncrypt

  • cloudkms.cryptoKeys.get

For detailed steps on managing KMS permissions, see Using IAM with Cloud KMS.

Recover from key policy lockout

If you accidentally removed Confluent’s access and your cluster is no longer working:

Warning

Key policy lockouts can cause immediate cluster unavailability. Act quickly to restore access and minimize service disruption.

  1. Immediately restore the previous working key policy from your backup.

  2. If you don’t have a backup, recreate the required permissions using the examples in the previous section.

  3. Monitor cluster health after restoring permissions. It may take several minutes for the cluster to recover.

  4. Contact Confluent Support if the cluster doesn’t recover within 30 minutes of restoring permissions.

Fix expired or rotated credentials

If Confluent’s access credentials have expired or been rotated:

  1. Check if the IAM role referenced in your key policy still exists.

  2. Verify the role’s trust policy allows Confluent to assume it.

  3. If the role was deleted, contact Confluent Support for the current role information and update your key policy.

  1. Verify the Confluent service principal still exists in your Azure AD.

  2. Check if the service principal’s credentials have expired.

  3. If expired, contact Confluent Support for updated service principal information.

  1. Verify the Google Group ID is still valid and active.

  2. Check if the group membership or permissions have changed.

  3. If the group ID has changed, contact Confluent Support for the updated information.

Prevent Future Key Policy Issues

To prevent key policy issues in the future:

  • Implement change management processes for key policy modifications.

  • Use infrastructure as code (Terraform, CloudFormation, ARM templates) to manage key policies consistently.

  • Set up monitoring and alerting for key access failures.

  • Regularly review and audit key policies and permissions.

  • Test disaster recovery procedures in non-production environments.

  • Keep documentation updated with current Confluent account and role information.

Get Additional Help

If you continue to experience key policy issues:

  1. Collect diagnostic information:

    • Cluster ID and environment details

    • Error messages from the Confluent Cloud Console

    • Cloud provider error logs

    • Current key policy configuration

  2. Contact Confluent Support with the diagnostic information.

  3. For urgent production issues, use the emergency contact procedures provided in your support agreement.

Related documentation: