This quick start uses the Dataproc connector to export data produced by the Avro
console producer to HDFS in a Dataproc managed cluster.
Load the Connector
This quick start assumes that security is not configured for HDFS and Hive
metastore. To make the necessary security configurations, see
Secure HDFS and Hive Metastore.
First, start all the necessary services using the Confluent CLI.
If not already in your PATH, add Confluent’s
bin directory by running:
Make sure to run the connector somewhere with network access to Dataproc
cluster, such as a Google Compute Engine VM on the same subnet.
The command syntax for the Confluent CLI development commands changed in 5.3.0.
These commands have been moved to
confluent local. For example, the syntax for
confluent start is now
confluent local services start. For more information, see confluent local.
confluent local services start
Next, start the Avro console producer to import a few records to Kafka:
./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_dataproc \
Then in the console producer, type:
The three records entered are published to the Kafka topic
test_dataproc in Avro format.
Before starting the connector, make sure that the configurations in
etc/gcp-dataproc-sink-quickstart.properties are properly set to your
configurations of Dataproc. For example,
$home is replaced by your home
YOUR-CLUSTER-NAME are replaced by your perspective values.
Then, start connector by loading its configuration with the following command.
You must include a double dash (
--) between the topic name and your flag. For more information,
see this post.
confluent local services connect connector load dataproc-sink --config etc/gcp-dataproc-sink-quickstart.properties
To check that the connector started successfully, view the Connect worker’s log
confluent local services connect log
Towards the end of the log you should see that the connector starts, logs a few
messages, and then exports data from Kafka to HDFS. After the connector finishes
ingesting data to HDFS, check that the data is available in HDFS. From the HDFS
namenode in Dataproc:
hadoop fs -ls /topics/test_dataproc/partition=0
You should see a file with the name
The file name is encoded as
You can use
avro-tools-1.9.1.jar (available in Apache mirrors)
to extract the content of the file. Run
avro-tools directly on Hadoop as:
hadoop jar avro-tools-1.9.1.jar tojson \
where “<namenode>” is the HDFS Namenode hostname. Usually, the Namenode
hostname will be your clustername with “-m” postfix attached.
Or, if you experience issues, first copy the avro file from HDFS to the local
filesystem and try again with Java:
hadoop fs -copyToLocal /topics/test_dataproc/partition=0/test_dataproc+0+0000000000+0000000002.avro \
java -jar avro-tools-1.9.1.jar tojson /tmp/test_dataproc+0+0000000000+0000000002.avro
You should see the following output:
Finally, stop the Connect worker as well as all the rest of Confluent Platform by running:
or stop all the services and additionally wipe out any data generated during
this quick start by running:
If you want to run the quick start with Hive integration,
you need to add the following configurations to
etc/gcp-dataproc-sink-quickstart.properties before starting the connector:
After the connector finishes ingesting data to HDFS, you can use Hive to
check the data from your namenode:
$hive>SELECT * FROM test_dataproc;
If you leave the
hive.metastore.uris empty, an embedded Hive metastore
will be created in the directory the connector is started. You need to start
Hive in that specific directory to query the data.