Tableflow Quick Start with Delta Lake Tables in Confluent Cloud

Confluent Tableflow enables exposing Apache Kafka® topics as Delta Lake tables.

This quick start guides you through the steps to get up and running with Confluent Tableflow for Delta Lake tables.

Note

Tableflow for Delta Lake tables is available for preview.

A Preview feature is a Confluent Cloud component that is being introduced to gain early feedback from developers. Preview features can be used for evaluation and non-production testing purposes or to provide feedback to Confluent. The warranty, SLA, and Support Services provisions of your agreement with Confluent do not apply to Preview features. Confluent may discontinue providing preview releases of the Preview features at any time in Confluent’s’ sole discretion.

In this quick start, you perform the following steps:

Prerequisites

  • DeveloperWrite access on all schema subjects
  • CloudClusterAdmin access on your Kafka cluster
  • Assigner access on all provider integrations
  • Access to a Databricks workspace.

For more information, see Grant Role-Based Access for Tableflow in Confluent Cloud.

Step 1: Create a topic and publish data

In this step, you create a stock-trades topic by using Confluent Cloud Console. Click Add topic, provide the topic name, and create it with default settings. You can skip defining a contract.

Publish data to the stock-trades topic by using the Datagen Source connector with the Stock Trades data set. When you configure the Datagen connector, click Additional configuration and proceed through the provisioning workflow. When you reach the Configuration step, in the Select output record value format, select Avro. Click Continue and keep the default settings. For more information, see Datagen Source Connector Quick Start.

Step 2: Configure your S3 bucket and provider integration

Before you materialize your Kafka topic as a table, you must configure the storage bucket where the materialized tables are stored.

To access your S3 bucket and write materialized data into it, you must create a Confluent Cloud provider integration.

  1. In the AWS Management Console, create an S3 bucket in your preferred AWS account. For this guide, name the bucket tableflow-quickstart-storage.

  2. In your Confluent Cloud environment, navigate to the Provider Integrations tab to create a provider integration and grant Confluent Cloud access to your S3 bucket.

  3. Click on Add Integration.

  4. In the Configure role in AWS section, select New Role option.

  5. Select Tableflow S3 Bucket and copy the IAM Policy template.

    Screenshot of AWS permission policy assignment for a Confluent Tableflow S3 bucket
  6. In the AWS console, navigate to your IAM.

  7. In the Access Management section, click Policies, and in the Policies page, click Create Policy.

    As a best practice, you should create a designated IAM policy that grants Confluent Cloud access to your S3 location.

  8. Paste the IAM policy template you obtained previously. Update it with the name of your S3 bucket, for example, tableflow-quickstart-storage, and create a new AWS IAM policy.

    Screenshot of the AWS permission policy editor for a Confluent Tableflow S3 bucket
  9. Navigate to AWS IAM Roles and click Create Role.

  10. For the Trusted entity type, select Custom trust policy.

  11. From the Tableflow UI in Cloud Console, copy the Trust-policy.json file and paste it into the policy editor in the AWS console.

  12. Attach the permission policy you created previously and save your new IAM role, for example, tableflow-quickstart-role.

  13. Copy the role ARN, for example, arn:aws:iam::<xxx>:role/tableflow-quickstart-role.

  14. In the Cloud Console, locate the Map the role in Confluent section in the Provider Integration page. In the AWS ARN section, paste the ARN you copied previously and click Continue.

    Screenshot showing how to map the ARN of an AWS IAM role in Confluent provider integration
  15. After creating the provider integration, update the trust policy of the AWS IAM role (tableflow-quickstart-role) using the trust policy displayed in Cloud Console.

    Screenshot showing the trust policy for a Confluent IAM role

Step 3: Enable Tableflow on your topic

With the provider integration configured, you can enable Tableflow on your Kafka topic to materialize it as a table in the storage bucket that you created in Step 2.

  1. Navigate to your stock-trades topic and click Enable Tableflow.

  2. In the Enable Tableflow dialog, select Delta as the table format.

    Enable Tableflow dialog in Confluent Cloud Console
  3. Click Configure custom storage.

  4. In the Choose where to store your Tableflow data section, click Store in your bucket.

  5. In the Provider integration dropdown, select the provider integration that you created in Step 2. Provide the name of the storage bucket that you created, which in this guide is tableflow-quickstart-storage.

  6. Click Continue to review the configuration and launch Tableflow.

    Materializing a newly created topic as a table can take a few minutes.

  7. After enabling Tableflow, in the Monitor section, copy the storage location of the table.

    Tableflow storage location in Cloud Console

For low-throughput topics in which Kafka segments have not been filled, Tableflow tries optimistically to publish data every 15 minutes. This is best-effort and not guaranteed.

Step 4: Query Delta Lake tables from Databricks

You can query Delta tables materialized by Tableflow in Databricks as external tables.

Tableflow currently doesn’t support catalog integration with Databricks Unity Catalog, so you must consume Delta Lake tables as storage-backed external tables.

Create an External Location in Databricks

To create an external table in Databricks, you must first create an External Location.

  1. Log in to the Databricks workspace that you use to query Delta Lake tables.

  2. Click Catalog to open Catalog Explorer.

  3. On the Quick access page, click External data, and in the External locations tab, and click Create external location.

  4. In the Create a new external location dialog, select Manual and click Next.

  5. Create a new external location by providing the external location name, the URL of the S3 bucket where Delta Tables are stored, and create a new storage credential by following these steps.

    Screenshot of the new external location dialog in Databricks

Create and query an external Delta Lake table

After creating an external location, you can query tables there.

  1. In the Databricks workspace, navigate to the SQL Editor and select the catalog that you want to use.

  2. Obtain the S3 storage path associated with your table and run a CREATE TABLE query to create a corresponding external Delta Lake table. Replace the example LOCATION URI with yours.

    CREATE TABLE my_uc.default.ext_stockquotes
       USING DELTA
       LOCATION 's3://tableflow-quickstart-storage/1001010/11101100/ab1234c5-12a3-45bc-a67d-ab1c2345d6ef/a-12345/<cluster-id>/v1/123a456d-1a23-4567-abc0-a12b3c45678c/';
    
  3. Run the following query to see the records in your Delta Lake table.

    SELECT * FROM my_uc.default.ext_stockquotes;