Tiered Storage¶
Starting with Confluent Platform 6.0.0, Tiered Storage is fully supported (after a preview in previous releases). Tiered Storage makes storing huge volumes of data in Kafka manageable by reducing operational burden and cost. The fundamental idea is to separate the concerns of data storage from the concerns of data processing, allowing each to scale independently. With Tiered Storage, you can send warm data to cost-effective object storage, and scale brokers only when you need more compute resources.
Important
- Prior to Confluent Platform 6.0.1, Tiered Storage cannot be disabled once it is enabled. In Confluent Platform 6.0.1 and later, you can disable Tiered Storage.
- The same bucket must be used across all brokers within a Tiered Storage enabled cluster. This applies to all supported platforms.
Enabling Tiered Storage on a Broker¶
To enable Tiered Storage on a cluster running Confluent Platform 6.0.0 or later, specify the configurations for your cloud provider as described below.
After you update these configurations, restart the brokers enabled for Tiered Storage. This restart can be done in a rolling fashion.
Tip
You can also begin this procedure from Confluent Control Center, as described in Configure Tiered Storage.
AWS¶
Tiered Storage requires a bucket in Amazon S3 that can be written to and read from, and it requires the credentials to access the bucket.
To enable Tiered Storage on Amazon Web Services (AWS) with Amazon Simple Storage Service (S3 buckets):
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Quick Start for Apache Kafka using Confluent Platform (Local), uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on AWS with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For Amazon S3, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data.
For example, a bucket named
tiered-storage-test-aws
located in theus-west-2
region would have these properties:confluent.tier.s3.bucket=tiered-storage-test-aws confluent.tier.s3.region=us-west-2
The brokers need AWS credentials to connect to the S3 bucket. You can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_AWS_CREDENTIALS_FILE>
Replace
<PATH>
with the file path of the file that contains your AWS credentials.This field is hidden from the server log files.
Environment Variables - Specify AWS credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
The S3 bucket should allow the broker to perform the following actions. These operations are required by the broker to properly enable and use Tiered Storage.
s3:DeleteObject s3:GetObject s3:PutObject s3:GetBucketLocation
GCS¶
To enable Tiered Storage on Google Cloud Platform (GCP) with Google Cloud Storage (GCS):
To enable Tiered Storage, add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=GCS confluent.tier.gcs.bucket=<BUCKET_NAME> confluent.tier.gcs.region=<REGION> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Quick Start for Apache Kafka using Confluent Platform (Local), uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on GCS with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service a broker connects to. For Google Cloud Storage, set this toGCS
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data.
For example, a bucket named
tiered-storage-test-gcs
located in theus-central1
region would have these properties:confluent.tier.gcs.bucket=tiered-storage-test-gcs confluent.tier.gcs.region=us-central1
The brokers need GCS credentials to connect to the GCS bucket. You can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.gcs.cred.file.path=<PATH_TO_GCS_CREDENTIALS_FILE>
This field is hidden from the server log files.
Environment Variables - Specify GCS credentials with this local environment variable:
export GOOGLE_APPLICATION_CREDENTIALS=<PATH_TO_GCS_CREDENTIALS_FILE>
If
server.properties
does not contain the property with the path to the credentials file, the broker will use the above environment variable to connect to the GCS bucket.
See the GCS documentation for more information.
The GCS bucket should allow the broker to perform the following actions. These operations are required by the broker to properly enable and use Tiered Storage.
storage.buckets.get storage.objects.get storage.objects.list storage.objects.create storage.objects.delete storage.objects.update
- Troubleshooting Certificates
If the brokers fail to start due to Tiered Storage errors such as inability to access buckets and security certificate issues, make sure that you have the needed Google CA certificate(s). To troubleshoot:
Go to Google Trust Services repository, scroll down to the section Download CA certificates, and click Expand all.
Choose a certificate suitable for your cluster (for example, GlobalSign R4) that is currently valid (not yet expired), click the Action drop-down next to it, and download the Certificate (PEM) file to all the brokers in the cluster.
Import the certificate by running the following command:
keytool -import -trustcacerts -keystore /var/ssl/private/kafka_broker.truststore.jks -alias root -file <certificate.pem file>
Pure Storage FlashBlade¶
To enable Tiered Storage on Pure Storage FlashBlade through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> confluent.tier.s3.aws.endpoint.override=<FLASHBLADE ENDPOINT> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Quick Start for Apache Kafka using Confluent Platform (Local), uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on Pure Storage FlashBlade with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For Pure Storage FlashBlade, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. Set the region to any valid AWS region; these are all equivalent in Flashblade. The region cannot be null or empty. For example, a bucket namedtiered-storage-test-aws
located in theus-west-2
region would have these properties:confluent.tier.s3.bucket=tiered-storage-test-aws confluent.tier.s3.region=us-west-2
ENDPOINT OVERRIDE
refers to the Pure Storage FlashBlade connection point.
The brokers need credentials generated by Pure Storage CLI to connect to the FlashBlade S3 Bucket. You can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_AWS_CREDENTIALS_FILE>
Replace
<PATH_TO_AWS_CREDENTIALS_FILE>
with the file path of the file that contains your AWS credentials.This field is hidden from the server log files.
Environment Variables - Specify Pure Storage FlashBlade credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<YOUR_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<YOUR_SECRET_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
Nutanix Objects¶
To enable Tiered Storage on Nutanix Objects through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> confluent.tier.s3.aws.endpoint.override=<NUTANIX OBJECTS ENDPOINT> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Apache Kafka Quick Start, uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on Nutanix Objects with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For Nutanix Objects, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. Set the region to any valid AWS region. The region cannot be null or empty. For example, a bucket namedtiered-storage-test-confluent
located in theus-east-1
region would have these properties:confluent.tier.s3.bucket=tiered-storage-test-confluent confluent.tier.s3.region=us-east-1
ENDPOINT OVERRIDE
refers to the Nutanix Objects endpoint fully qualified domain name. Ifntnx-obj
is the object store name andnutanix.com
is the domain name, then endpoint override should be as follows:confluent.tier.s3.aws.endpoint.override=http://ntnx-obj.nutanix.com
The brokers need credentials generated from the Objects page of Nutanix prism central to connect to the Nutanix Objects S3 Bucket. Users can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_AWS_CREDENTIALS_FILE>
Replace
<PATH_TO_AWS_CREDENTIALS_FILE>
with the file path of the file that contains your Nutanix credentials.This field is hidden from the server log files.
Environment Variables - Specify Nutanix Objects credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<NUTANIX_OBJECTS_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<NUTANIX_OBJECTS_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
The Nutanix Objects bucket should allow the broker to perform the following actions. These operations are required by the broker to properly enable and use Tiered Storage. These operations can be allowed by providing read/write access to the user on the bucket.
storage.buckets.get storage.objects.get storage.objects.list storage.objects.create storage.objects.delete storage.objects.update
NetApp Object Storage¶
To enable Tiered Storage on NetApp Object Storage through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> confluent.tier.s3.aws.endpoint.override=<NETAPP OBJECT STORAGE ENDPOINT> confluent.tier.s3.force.path.style.access=<BOOLEAN> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Apache Kafka Quick Start, uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on NetApp Object Storage with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For NetApp Object Storage, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. Set the region to any valid AWS region. The region cannot be null or empty. For example, a bucket namedtiered-storage-test-confluent
located in theus-east-1
region would have these properties:confluent.tier.s3.bucket=tiered-storage-test-confluent confluent.tier.s3.region=us-east-1
ENDPOINT OVERRIDE
refers to the NetApp Object Storage endpoint.Confluent.tier.s3.force.path.style.access
configures the client to use path-style access for all requests. This flag is not enabled by default. The default behavior is to detect which access style to use based on the configured endpoint and the bucket being accessed. Setting this flag will result in path-style access being forced for all requests. NetApp supports both virtual host style and path style access. Setting this parameter to true enables path style access.
The brokers need credentials generated by NetApp Object Storage to connect to the NetApp Object Storage S3 Bucket. Users can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_AWS_CREDENTIALS_FILE>
Replace
<PATH_TO_AWS_CREDENTIALS_FILE>
with the file path of the file that contains your NetApp Object Storage credentials.This field is hidden from the server log files.
Environment Variables - Specify NetApp Object Storage credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<NETAPP_OBJECT_STORAGE_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<NETAPP_OBJECT_STORAGE_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
The NetApp Object Storage bucket should allow the broker to perform the following actions. These operations are required by the broker to properly enable and use Tiered Storage. These operations can be allowed by providing read/write access to the user on the bucket.
storage.buckets.get storage.objects.get storage.objects.list storage.objects.create storage.objects.delete storage.objects.update
Dell EMC ECS¶
To enable Tiered Storage on Dell EMC ECS through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> confluent.tier.s3.aws.endpoint.override=<DELL EMC ECS ENDPOINT> confluent.tier.s3.force.path.style.access=<BOOLEAN> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Apache Kafka Quick Start, uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on Dell EMC ECS with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For Dell EMC ECS, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. Set the region to any valid AWS region. The region cannot be null or empty. For example, a bucket namedtiered-storage-test-confluent
located in theus-east-1
region would have these properties:confluent.tier.s3.bucket=tiered-storage-test-confluent confluent.tier.s3.region=us-east-1
ENDPOINT OVERRIDE
refers to the Dell EMC ECS endpoint.Confluent.tier.s3.force.path.style.access
configures the client to use path-style access for all requests. This flag is not enabled by default. The default behavior is to detect which access style to use based on the configured endpoint and the bucket being accessed. Setting this flag will result in path-style access being forced for all requests.
The brokers need credentials generated by Dell EMC ECS to connect to the ECS Bucket. Users can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_AWS_CREDENTIALS_FILE>
Replace
<PATH_TO_AWS_CREDENTIALS_FILE>
with the file path of the file that contains your Dell EMC ECS credentials.This field is hidden from the server log files.
Environment Variables - Specify Dell EMC ECS credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<DELL_EMC_ECS_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<DELL_EMC_ECS_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
The Dell EMC ECS bucket should allow the broker to perform the following actions. These operations are required by the broker to properly enable and use Tiered Storage. These operations can be allowed by providing read/write access to the user on the bucket.
storage.buckets.get storage.objects.get storage.objects.list storage.objects.create storage.objects.delete storage.objects.update
MinIO¶
To enable Tiered Storage on MinIO through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=<REGION> confluent.tier.s3.aws.endpoint.override=<MINIO_ENDPOINT> confluent.tier.s3.force.path.style.access=true # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Apache Kafka Quick Start, uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on MinIO with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. When set totrue
, this causes all existing, non-compacted topics to have this configuration set totrue
as well. Only topics explicitly set tofalse
do not use tiered storage. See Known Limitations with compacted topics. It is not required to setconfluent.tier.enable=true
to enable Tiered Storage.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For MinIO Object Storage, set this toS3
as shown above.BUCKET_NAME
andREGION
are the S3 bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. MinIO defaults to us-east-1 for REGION unless explicitly started with a different region value. Replace REGION with the appropriate value based on your MinIO configuration. You can retrieve this value using the mc commandline tool:mc admin info –json ALIAS | jq ".info.region"
Create the bucket before configuring Tiered Storage. See mc mb for details on creating a bucket.
confluent.tier.s3.aws.endpoint.override=http://minio.example.net confluent.tier.s3.force.path.style.access=true
ENDPOINT OVERRIDE
refers to the MinIO Object Storage endpoint fully qualified domain name. MinIO recommends using a load balancer configured for round-robin selection of all hosts in the MinIO deployment. If minio is the load balancer hostname and example.com is the domain name, then the endpoint override should be as follows.confluent.tier.s3.aws.endpoint.override=http://minio.example.net
confluent.tier.s3.force.path.style.access
configures the client to use path-style access for all requests. This flag is not enabled by default. The default behavior is to detect which access style to use based on the configured endpoint and the bucket being accessed. Setting this flag will result in path-style access being forced for all requests. MinIO deployments typically support path-style access only. There are no advantages to using other access styles.
The brokers need credentials generated by MinIO to connect to the MinIO Bucket. Users can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_MINIO_CREDENTIALS_FILE>
Replace
<PATH_TO_MINIO_CREDENTIALS_FILE>
with the file path of the file that contains your MinIO Object Storage credentials The user must have read/write/list permissions on the BUCKET_NAME bucket.This field is hidden from the server log files.
Environment Variables - Specify MinIO credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<MINIO_OBJECT_STORAGE_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<MINIO_OBJECT_STORAGE_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the S3 bucket.
The MinIO credentials must allow the broker to perform the following actions. These operations are required by the broker to properly enable ad use tiered storage. These operations can be allowed by providing read/write access to the user on the bucket.
{ “PolicyName”: “ConfluentTieredStorage”, “Policy”: { “Version”: “2012-10-17”, “Statement”: [ { “Effect”: “Allow”, “Action”: [ “s3:DeleteObject”, “s3:GetObject”, “s3:PutObject”, “s3:GetBucketLocation” ], “Resource”: [ “arn:aws:s3:::BUCKET_NAME”, “arn:aws:s3:::BUCKET_NAME/*” ] } ] } }
Use mc admin policy set to assign this policy to the user credentials.
Hitachi Content Platform Object Storage¶
The following assumes that a bucket has previously been created on the Hitachi Content Platform (HCP) with versioning enabled. To enable Tiered Storage on HCP Object Storage through the Amazon S3 API:
Add the following properties in your
server.properties
file:confluent.tier.feature=true confluent.tier.enable=true confluent.tier.backend=S3 confluent.tier.s3.bucket=<BUCKET_NAME> confluent.tier.s3.region=us-east-1 confluent.tier.s3.aws.endpoint.override=<HCP Object Storage ENDPOINT> confluent.tier.s3.force.path.style.access=<BOOLEAN> # confluent.tier.metadata.replication.factor=1
Tip
The Tiered Storage internal topic defaults to a replication factor of
3
. If you useconfluent local services start
to run a single broker cluster such as that described in Apache Kafka Quick Start, uncomment the last line shown above to override the default replication factor for the Tiered Storage topic.Adding the above properties enables the Tiered Storage components on HCP Object Storage with default parameters on all of the possible configurations.
confluent.tier.feature
enables Tiered Storage for a broker. Setting this totrue
allows a broker to utilize Tiered Storage.confluent.tier.enable
sets the default value for created topics. Setting this totrue
causes all non-compacted topics to be tiered. See Known Limitations with compacted topics.confluent.tier.backend
refers to the cloud storage service to which a broker will connect. For HCP Object Storage, set this toS3
as shown above.BUCKET_NAME
andREGION
are the HCP Object Storage bucket name and its region, respectively. A broker interacts with this bucket for writing and reading tiered data. Because HCP is local the region is ignored, however it is a required configuration; useus-east-1
. For example, a bucket namedtiered-storage-test-hitachi
would have these properties:confluent.tier.s3.bucket=tiered-storage-test-hitachi confluent.tier.s3.region=us-east-1
ENDPOINT OVERRIDE
refers to the HCP Object Storage endpoint. This is the fully qualified domain name of the HCP tenant containing the bucket. For example:confluent.tier.s3.aws.endpoint.override=<TENANT>.hcp.example.com
Confluent.tier.s3.force.path.style.access
configures the client to use path-style access for all requests. This flag is not enabled by default. HCP supports both virtual host style and path-style access. The default behavior is to detect which access style to use based on the configured endpoint and the bucket being accessed. Setting this flag totrue
will result in path-style access being forced for all requests. One reason you may choose to enable this setting is to avoid certificate validation errors when the SSL certificate on the HCP includes a SAN for the tenant domain, but not for the bucket domain.
The brokers need credentials generated by HCP Object Storage to connect to the HCP Object Storage S3 Bucket. Users can set these through
server.properties
or through environment variables. Either method is sufficient. The brokers prioritize using the credentials supplied throughserver.properties
. If the brokers do not find credentials inserver.properties
, they use environment variables instead.Server Properties - Add the following property to your
server.properties
file:confluent.tier.s3.cred.file.path=<PATH_TO_HCP_CREDENTIALS_FILE>
Replace
<PATH_TO_HCP_CREDENTIALS_FILE>
with the file path of the file that contains HCP Object Storage credentials.This field is hidden from the server log files. The content of the <PATH_TO_HCP_CREDENTIALS_FILE> is as follows:
accessKey=xxxxxxxx secretKey=xxxxxxxxxxxxxx
Environment Variables - Specify HCP Object Storage credentials with these environment variables:
export AWS_ACCESS_KEY_ID=<HCP_ACCESS_KEY_ID> export AWS_SECRET_ACCESS_KEY=<HCP_ACCESS_KEY>
If
server.properties
does not contain the two properties for credentials, the broker will use the above environment variables to connect to the HCP Object Storage S3 bucket.
Creating a Topic with Tiered Storage¶
You can create a topic using the Kafka CLI command kafka-topics
(located in
$CONFLUENT_HOME/bin
). The command is used in the same way as previous
versions, with added support for topic configurations related to Tiered Storage.
kafka-topics --bootstrap-server localhost:9092 \
--create --topic trades \
--partitions 6 \
--replication-factor 3 \
--config confluent.tier.enable=true \
--config confluent.tier.local.hotset.ms=3600000 \
--config retention.ms=604800000
confluent.tier.local.hotset.ms
- controls the maximum time non-active segments are retained on broker-local storage before being discarded to free up space. Segments deleted from local disks will exist in object storage and remain available according to the retention policy. If set to -1, no time limit is applied.retention.ms
works similarly in tiered topics to non-tiered topics, but will expire segments from both object storage and local storage according to the retention policy.
Tip
- You can also configure topics from Control Center. To do so, log on to Control Center and navigate to a topic (<environment> -> <cluster> -> <topic>). To learn more, see Tiered Storage in the Control Center User Guide. For a local deployment, Control Center is available at http://localhost:9021/ in your web browser.
- For more on using Kafka commands, see Kafka Commands Primer in Kafka Basics on Confluent Platform.
Sending Test Messages to Experiment with Data Storage¶
You can use the topic you created in the previous section as a test case, or create
a new one. To speed up the rate at which data is transferred to storage for the purpose
of this example, update from the default on the segment.bytes
setting on the
topic, as shown below. You can do this by updating configurations on an existing topic
from the Control Center expert mode settings on the topic, or create a new topic, as shown.
kafka-topics --bootstrap-server localhost:9092 \
--create --topic hot-topic \
--partitions 6 \
--replication-factor 3 \
--config confluent.tier.enable=true \
--config confluent.tier.local.hotset.ms=3600000 \
--config retention.ms=604800000 \
--config segment.bytes=10485760
Once you have Tiered Storage configured, you can send test data to one or more topics, and run a consumer to read the messages. For example, use the following command to produce messages:
kafka-producer-perf-test \
--producer-props bootstrap.servers=localhost:9092 \
--topic hot-topic \
--record-size 1000 \
--throughput 1000 \
--num-records 3600000
Let this run for 5 or 10 minutes, and then run a consumer, for example:
kafka-console-consumer --bootstrap-server localhost:9092 --from-beginning --topic hot-topic
After 10 or 20 minutes, check the Control Center UI, and you should see the data saved off to your storage container. For a local deployment, Control Center is available at http://localhost:9021/ in your web browser.
To learn more about using Control Center to configure and manage Tiered Storage, see Tiered Storage.
Best Practices and Recommendations¶
Tuning¶
To improve the performance of Tiered Storage, you can increase TierFetcherNumThreads
and TierArchiverNumThreads
.
As a general guideline, you want to increase TierFetcherNumThreads
to match the number of physical CPU cores and
TierArchiverNumThreads
to half the number of CPU cores. For example, if you have a machine with 8 physical cores, set
TierFetcherNumThreads
= 8 and TierArchiverNumThreads
= 4.
Time Interval for Topic Deletes¶
When a topic is deleted, the deletion of the log segment files in object storage
does not immediately begin. There is a time interval for which the deletion of
those files takes place. The default value for this time interval is 3 hours.
You can modify the configuration, confluent.tier.topic.delete.check.interval.ms
,
to change the value of this interval. It’s important to keep this in mind when deleting
a topic or a cluster. Once a topic or cluster is deleted, it’s OK to manually delete the
objects in the respective bucket.
Log Segment Sizes¶
We recommended decreasing the segment size configuration, log.segment.bytes
,
for topics with tiering enabled from the default size of 1GB. The archiver waits
for a log segment to close before attempting to upload the segment to object
storage. Using a smaller segment size, such as 100MB, allows segments to close
at a more frequent rate. Also, smaller segments sizes help with page cache
behavior, improving the performance of the archiver.
ACLs on Tiered Storage Internal Topics¶
A recommended best practice for on-premises deployments is to enable an ACL authorizer on the internal topics used for Tiered Storage. Set ACL rules to limit access on this data to the broker user only. This secures the internal topics, and prevents unauthorized access to Tiered Storage data and metadata.
For example, the following command sets ACLs on the internal topic for Tiered
Storage, _confluent-tier-state
. (Currently, there is only a single internal
topic related to Tiered Storage.) The example creates an ACL that provides the principal kafka
permission for all operations on the internal topic.
kafka-acls --bootstrap-server localhost:9092 --command-config adminclient-configs.conf \
--add --allow-principal User:<kafka> --operation All --topic "_confluent-tier-state"
Tip
In practice, replace the User:<kafka>
with the actual broker principal in your deployment.
To sets ACLs on any Tiered Storage internal topic (if others are added in future
releases), specify the topic name with _confluent-tier
as the prefix. For example,
the following command sets ACLs on any internal topic related to Tiered Storage.
kafka-acls --bootstrap-server localhost:9092 --command-config adminclient-configs.conf \
--add --allow-principal User:<kafka> --operation All --topic "_confluent-tier-" --resource-pattern-type PREFIXED
Sizing Brokers with Tiered Storage¶
With Tiered Storage enabled, confluent.tier.local.hotset.ms
controls how long
segments are retained on broker-local storage before being discarded to free up
space. While this setting is ultimately a business decision, there are some
additional practices that will help with sizing. If you find that consumers are
lagging while fetching data from object storage, it’s usually a good idea to
increase confluent.tier.local.hotset.ms
. When planning disk sizes on your
brokers, it’s also important to consider leaving headroom to accommodate for
potential issues communicating with object storage. While cloud object storage
outages are extremely rare, they can happen and Tiered Storage will continue to
store segments at the broker until it can successfully tier them to object
storage.
Tier Archiver Metrics¶
The archiver is a component of Tiered Storage that is responsible for uploading non-active segments to cloud storage.
kafka.tier.tasks.archive:type=TierArchiver,name=BytesPerSec
- Rate of bytes per second that is being uploaded by the archiver to cloud storage.
kafka.tier.tasks.archive:type=TierArchiver,name=TotalLag
- Number of bytes in non-active segments not yet uploaded by the archiver to cloud storage. As the archiver steadily uploads to cloud storage, the total lag will decrease towards 0.
kafka.tier.tasks.archive:type=TierArchiver,name=RetriesPerSec
- Number of times the archiver has reattempted to upload a non-active segment to cloud storage.
kafka.tier.tasks:type=TierTasks,name=NumPartitionsInError
- Number of partitions that the archiver was unable to upload that are in an error state. The partitions in this state are skipped by the archiver and uploaded to cloud storage.
Tier Fetcher Metrics¶
The fetcher is a component of Tiered Storage that is responsible for retrieving data from cloud storage.
kafka.server:type=TierFetcher
- Rate of bytes per second that the fetcher is retrieving data from cloud storage.
Example Performance Test¶
We have recorded the throughput performance of a Kafka cluster with the Tiered Storage feature activated as a reference for expected metrics. The high-level details and configurations of the cluster and environment are listed.
Cluster Configurations:
- 6 brokers
- r5.xlarge AWS instances
- GP2 disks, 2 TB
- 104857600 (100 MB) segment.bytes
- a single topic with 24 partitions and replication factor 3
Producer and Consumer Details:
- 6 producers each with a target rate of 20 MB/s (120 MB/s total)
- 12 total consumers
- 6 consumers read records from the end of the topic with a target rate of 20 MB/s
- 6 consumers read records from the beginning of the topic from S3 through the fetcher with an uncapped rate
Values may differ based on the configurations of producers, consumers, brokers, and topics. For example, using a log segment size of 1 GB instead of 100 MB may result in lower archiver throughput.
Throughput Metric | Value (MB/s) |
---|---|
Producer (cluster total) | 115 |
Producer (average across brokers) | 19 |
Consumer (cluster total) | 623 |
Archiver (average across brokers) | 19 |
Fetcher (average across brokers) | 104 |
Supported Platforms and Features¶
- AWS with Amazon S3, GCP with Google Cloud Storage (GCS), Pure Storage FlashBlade, Nutanix Objects, NetApp Object Storage, Dell EMC ECS, and MinIO through the Amazon S3 API are supported with Tiered Storage.
- Starting with Confluent Platform 6.0.0, the Kubernetes Confluent Operator supports Tiered Storage, as described in Tiered Storage under Configure Storage with Confluent Operator.
Known Limitations¶
- While Tiered Storage uses the Amazon S3 API, non-certified object stores are not supported. Currently, only Amazon S3, Google GCS, Pure Storage FlashBlade, Nutanix Objects, NetApp Object Storage, Dell EMC ECS, and MinIO are supported.
- Compacted topics are not yet supported by Tiered Storage.
- Currently, Azure Blob Storage is not supported by Tiered Storage.
- JBOD (just a bunch of disks) is not supported because Tiered Storage does not currently support multiple log directories, an inherent requirement of JBOD.
Disabling Tiered Storage¶
Starting in Confluent Platform 6.0.1, you can disable Tiered Storage by setting
confluent.tier.enable=false
. This will disable additional tiering. New and
existing data not yet offloaded to object storage will not be offloaded.
Previously tiered (offloaded) data will remain in object storage (tiered).
If you disable Tiered Storage, make sure that confluent.tier.feature
remains enabled (confluent.tier.feature=true
) so that
tiering related components like the tier fetcher remain enabled and previously tiered data can be fetched.
Suggested Reading¶
- Configure and Manage Tiered Storage from Confluent Control Center
- Kafka Commands Primer in Kafka Basics on Confluent Platform
- Kafka Topic Configurations Reference
- Kafka Broker and Controller Configurations Reference
- Blog post: Property Based Testing Confluent Server Storage for Fun and Safety
- Blog post: Infinite Storage in Confluent Platform
- Blog post: Streaming Machine Learning with Tiered Storage and Without a Data Lake
- Blog post: Cloud-Like Flexibility and Infinite Storage with Confluent Tiered Storage and FlashBlade from Pure Storage
- Blog post: Simplify Kafka at Scale with Confluent Tiered Storage