Storage with Tableflow in Confluent Cloud

Apache Iceberg™ tables managed by Tableflow are stored in Confluent Managed Storage or in a custom storage provider, which is referred to as “external object storage”, “Bring Your Own Storage”, or “Bring Your Own Bucket”.

  • Amazon S3 is the only custom storage provider supported.

Confluent Managed Storage

Tableflow can store your Iceberg tables in Confluent Managed Storage. Confluent Managed Storage is Confluent’s’ “batteries included” storage option for Tableflow. There are no additional configurations required to use Confluent Managed Storage with Tableflow. Access to Confluent Managed Storage and your Tableflow-enabled Iceberg tables is controlled by your Confluent Cloud Access Controls Confluent Cloud Access Controls.

To access tables stored in Confluent Managed Storage, you must use a query engine that supports the Iceberg REST API and vended credentials. Apache Spark® and Trino are query engines that work with Confluent Managed Storage.

Tableflow tables that use Confluent Managed Storage are not compatible with external catalogs, like AWS Glue.

Amazon S3

Tableflow requires a provider integration to be configured at the environment level of the topic you’re enabling Tableflow on. Multiple Tableflow-enabled topics can use the same S3 bucket and provider integration. Tableflow supports only S3 Standard Storage Class buckets.

To access tables stored in Amazon S3, you must use an environment or resource that is authenticated with an AWS IAM role that has GetObject permissions for the objects stored in the S3 bucket where your table is stored.

Important

You should start with an empty bucket when you first enable Tableflow. Existing objects in the bucket may cause Tableflow to fail to start or may be lost entirely during initialization.

Snapshot retention

Snapshot retention involves managing metadata that enables you to query a previous state of your table, also known as “time-travel queries”. Tableflow creates a snapshot every time it commits a change to your table. This includes any time Tableflow adds or updates data to your table, and when it performs maintenance tasks, like compaction.

Tableflow always maintains a minimum number of snapshots, but you can configure how long additional snapshots should be retained before they are expired by setting the retention_ms configuration. You can set this value to infinity or a specific length of time. When a snapshot is past its expiration time, Tableflow asynchronously removes the snapshot from the table, as well as any data files that are no longer necessary. This operation doesn’t remove data that is still in use by your table, regardless of when that data was added to the table.