Configuring a non-VPC peering environment

When Confluent Cloud is set up with public endpoints in a non-VPC peering environment, connector requests originate from a public IP endpoint at the Confluent Cloud VPC where the Dataproc connector is running. However, the Dataproc cluster VPC does not provide a public IP address endpoint. Even if each Dataproc node has a Public IP address configured, the VPC does not, and the Hadoop daemon returns private IP addresses and private hostnames to the Confluent Cloud connector.

Private IP response to Confluent Cloud

Private IP response to Confluent Cloud

After you complete the following procedure:

  • The Dataproc connector can successfully establish connectivity to the GCP Dataproc cluster master node (HDFS NameNode).
  • The GCP Dataproc cluster can respond over public IP to the Confluent Cloud VPC and Dataproc connector.
  • All Dataproc nodes (HDFS NameNode and DataNodes) in the cluster retain the use of their private IP addresses.

The procedure assumes you are starting a new Dataproc and Confluent Cloud cluster.

Prerequisites
  • Authorization to update GCP instances (Dataproc nodes) and configure DNS record sets for your GCP project account.
  • The gcloud CLI must be installed and configured to manage your GCP Dataproc cluster.
  • Access to a running Dataproc cluster in GCP.
  • The Dataproc cluster must have the Cloud Resource Manager API enabled.
  • The Dataproc cluster VPC must have the following ports open (IP ranges: 0.0.0.0/0) for Confluent Cloud connector ingress:
    • tcp:8020
    • tcp:9000
    • tcp:9083
    • tcp:9864-9867

Step 1: Add or create record sets in Cloud DNS

To create a configuration in a non-VPC peered environment, you first need to add or create record sets in the GCP Cloud DNS service. Create the following zones:

  • public zone: Contains record sets corresponding to the external IP addresses of each Dataproc cluster node.
  • private zone #1: Contains record sets corresponding to the internal IP addresses of each Dataproc cluster node.
  • private zone #2: This is a managed reverse lookup zone. It contains the reverse internal IP addresses (in 10.in-addr.arpa. format) for each Dataproc cluster node.
Public DNS record set example

GCP Cloud DNS console

You can create DNS zones and record sets using the gcloud CLI or by using the GCP Cloud DNS console.

  1. Get the instance names, external IP addresses, and internal IP addresses for each of your Dataproc nodes.

    gcloud compute instances list --project=<my-gcp-project> --zone <region-zone> --filter "<my-cluster-ID>"
    

    For example:

    gcloud compute instances list --project=ccloud-lab-47372 --zones us-west1-c --filter "cluster-fa79"
    
    NAME              ZONE           MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP     STATUS
    cluster-fa79-m    us-central1-c  n1-standard-4               10.128.0.6   34.67.10.174    RUNNING
    cluster-fa79-w-0  us-central1-c  n1-standard-4               10.128.0.2   34.72.119.108   RUNNING
    cluster-fa79-w-1  us-central1-c  n1-standard-4               10.128.0.3   104.154.209.27  RUNNING
    
  2. Create or add each instance name and external IP address to a public cloud DNS zone using the gcloud CLI or the Cloud DNS console. Once you have created the DNS zone and record sets, view the records in the GUI or list them using the following gcloud command.

    gcloud dns record-sets list --zone=<public-dns-zone> --project=<gcp-project-ID>
    

    For example:

    gcloud dns record-sets list --zone=ccloud-dataproc-public --project=ccloud-lab-47372
    NAME                                       TYPE  TTL    DATA
    ccloud.dataproc.lab.net.                   NS    21600  ns-cloud-b1.googledomains.com.,ns-cloud-b2.googledomains.com.,ns-cloud-b3.googledomains.com.,ns-cloud-b4.googledomains.com.
    ccloud.dataproc.lab.net.                   SOA   21600  ns-cloud-b1.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
    cluster-fa79-m.ccloud.dataproc.lab.net.    A     300    34.67.10.174
    cluster-fa79-w-0.ccloud.dataproc.lab.net.  A     300    34.72.119.108
    cluster-fa79-w-1.ccloud.dataproc.lab.net.  A     300    104.154.209.27
    
  3. Create or add each instance name and internal IP address to a private cloud DNS zone using the gcloud CLI or the Cloud DNS console. Once you have created the DNS zone and record sets, view the records in the GUI or list them using the following gcloud command.

    gcloud dns record-sets list --zone=<private-dns-zone> --project=<gcp-project-ID>
    

    For example:

    gcloud dns record-sets list --zone=ccloud-dataproc-private --project=ccloud-lab-47372
    NAME                                       TYPE  TTL    DATA
    ccloud.dataproc.lab.net.                   NS    21600  ns-gcp-private.googledomains.com.
    ccloud.dataproc.lab.net.                   SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
    cluster-fa79-m.ccloud.dataproc.lab.net.    A     300    10.128.0.6
    cluster-fa79-w-0.ccloud.dataproc.lab.net.  A     300    10.128.0.2
    cluster-fa79-w-1.ccloud.dataproc.lab.net.  A     300    10.128.0.3
    
  4. Create or add each instance name and reverse lookup address (10.in-addr.arpa.) to a private cloud DNS zone using the gcloud CLI or the Cloud DNS console. Once you have created the DNS zone and record sets, view the records in the GUI or list them using the following gcloud command.

    gcloud dns record-sets list --zone=<private-reverse-dns-zone> --project=<gcp-project-ID>
    

    For example:

    gcloud dns record-sets list --zone=ccloud-dataproc-private-reverse --project=ccloud-lab-47372
    NAME                                       TYPE  TTL    DATA
    10.in-addr.arpa.          NS    21600  ns-gcp-private.googledomains.com.
    10.in-addr.arpa.          SOA   21600  ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300
    6.0.128.10.in-addr.arpa.  PTR   300    cluster-fa79-m.ccloud.dataproc.lab.net.
    2.0.128.10.in-addr.arpa.  PTR   300    cluster-fa79-w-0.ccloud.dataproc.lab.net.
    3.0.128.10.in-addr.arpa.  PTR   300    cluster-fa79-w-1.ccloud.dataproc.lab.net.
    

Step 2: (Optional) Create permanent custom hostnames

Note

GCP creates a default hostname for each Dataproc instance in the cluster. You can use the default GCP hostnames instead of creating custom hostnames. However, you may want to create custom hostnames that correspond to your network plan or specific cloud application.

Complete the following steps to set custom hostnames for each Dataproc cluster node. You store the hostname on the nodes using the gcloud CLI and the GCP metadata service (see Storing and retrieving instance metadata).

  1. Add a hostname to the Dataproc master node.

    gcloud compute instances add-metadata <master-instance-name> \
    --metadata <master-node-hostname> --zone <region-zone>
    

    For example:

    gcloud compute instances add-metadata cluster-fa79-m \
    --metadata hostname=master.cluster1.ccloud.net --zone us-west1-c
    
  2. Verify that the master node hostname is configured.

    gcloud compute instances describe <master-instance-name> --format='value[](metadata.items.hostname)' \
    --project=<my-gcp-project> --zone <region-zone>
    

    For example:

    gcloud compute instances describe cluster-fa79-m --format='value[](metadata.items.hostname)' \
    --project=cloud-lab-47372 --zone us-west1-c
    master.cluster1.ccloud.net
    
  3. Add a hostname for each Dataproc worker node. Complete this step for all worker nodes.

    gcloud compute instances add-metadata <worker-instance-name> --metadata <worker-node-hostname> --zone <region-zone>
    

    For example:

    gcloud compute instances add-metadata cluster-fa79-w-0 \
    --metadata hostname=worker0.cluster1.ccloud.net --zone us-west1-c
    
  4. Verify that the worker hostname is configured.

    gcloud compute instances describe <worker-instance-name> --format='value[](metadata.items.hostname)' \
    --project=<my-gcp-project> --zone <region-zone>
    

    For example:

    gcloud compute instances describe cluster-fa79-w-0 --format='value[](metadata.items.hostname)' \
    --project=ccloud-lab-47372 --zone us-west1-c
    worker0.cluster1.ccloud.net
    
  5. At this point if the nodes restarted, the hostnames would be lost. Make the master hostname persist on restart.

    gcloud compute instances add-metadata <master-instance-name> \
    --metadata startup-script="sudo -s hostnamectl set-hostname <master-node-hostname>" \
    --zone <region-zone>
    

    For example:

    gcloud compute instances add-metadata cluster-fa79-m \
    --metadata startup-script="sudo -s hostnamectl set-hostname master.cluster1.ccloud.net" \
    --zone us-west1-c
    Updated [https://www.googleapis.com/compute/v1/projects/ccloud-lab-47372/zones/us-central1-c/instances/cluster-fa79-m].
    
  6. Verify that the master node startup script is configured.

    gcloud compute instances describe <master-instance-name> --format='value[](metadata.items.startup-script)' \
    --project=<my-gcp-project> --zone <region-zone>
    

    For example:

    gcloud compute instances describe cluster-fa79-m --format='value[](metadata.items.startup-script)' \
    --project=ccloud-lab-47372 --zone us-west1-c
    sudo -s hostnamectl set-hostname master.cluster1.ccloud.net
    
  7. Make the worker hostnames persist on restart. Complete this step for all worker nodes.

    gcloud compute instances add-metadata <worker-instance-name> \
    --metadata startup-script="sudo -s hostnamectl set-hostname <worker-node-hostname>" \
    --zone <region-zone>
    

    For example:

    gcloud compute instances add-metadata cluster-fa79-w-0 \
    --metadata startup-script="sudo -s hostnamectl set-hostname worker0.cluster1.ccloud.net" \
    --zone us-west1-c
    Updated [https://www.googleapis.com/compute/v1/projects/ccloud-lab-47372/zones/us-central1-c/instances/cluster-fa79-w-0].
    
  8. Verify that the worker node startup script is configured. Complete this step for all worker nodes.

    gcloud compute instances describe <worker-instance-name> --format='value[](metadata.items.startup-script)' \
    --project=<my-gcp-project> --zone <region-zone>
    

    For example:

    gcloud compute instances describe cluster-fa79-w-0 --format='value[](metadata.items.startup-script)' \
    --project=ccloud-lab-47372 --zone us-west1-c
    sudo -s hostnamectl set-hostname worker0.cluster1.ccloud.net
    

Step 3: Verify external and internal IP mapping

Complete the following steps to verify that the external and internal IP mappings are configured properly.

  1. Open new terminal session and use nslookup to get the external address mappings. Use the hostname for each node. Complete this step for all worker nodes.

    nslookup <cluster-node-hostname>
    

    For example:

    nslookup master.cluster1.ccloud.net
    Server:     192.168.86.1
    Address:    192.168.86.1#53
    
    Non-authoritative answer:
    Name:    master.cluster1.ccloud.net
    Address: 208.91.197.26
    
  2. (Optional) Use ping to verify reachability to each node. Use the <cluster-node-hostname>.

    For example:

    ping master.cluster1.ccloud.net
    PING master.cluster1.ccloud.net (208.91.197.26): 56 data bytes
    64 bytes from 208.91.197.26: icmp_seq=0 ttl=240 time=58.091 ms
    64 bytes from 208.91.197.26: icmp_seq=1 ttl=240 time=57.666 ms
    64 bytes from 208.91.197.26: icmp_seq=2 ttl=240 time=59.568 ms
    
  3. Launch an SSH terminal session on one of the worker nodes. The example below shows the gcloud CLI command you can use.

    gcloud beta compute ssh --zone "<region-zone>" "<cluster-node-hostname>" --project "<my-gcp-project>"
    

    For example:

    gcloud beta compute ssh --zone "us-west1-c" "worker0.cluster1.ccloud.net" -project "ccloud-lab-47372"
    
    
    Updating project ssh metadata...
    
    Updated [https://www.googleapis.com/compute/beta/projects/ccloud-lab-47372].
    Updating project ssh metadata...done.
    Waiting for SSH key to propagate.
    Warning: Permanently added [] to the list of known hosts.
    
    ... omitted
    
  4. On the Dataproc worker node, use nslookup to get the internal address mappings for the master node. Use the hostname for each node. Complete this step for all worker nodes.

    nslookup master.cluster1.ccloud.net
    Server:     192.168.86.1
    Address:    192.168.86.1#53
    Non-authoritative answer:
    Name: master.cluster1.ccloud.net
    Address: 10.128.0.6
    

Step 4: Make core-site.xml and hdfs-site.xml modifications

Note

If you are using the default GCP hostnames, you do not have to complete all of the steps in this procedure. However, make sure to verify everything is set up properly at each step and make sure to add the public DNS name on each worker node in the step where this is requested.

Complete the following steps to modify core-site.xml and hfds-site.xml configuration files to use the new hostnames.

  1. Edit the /etc/hadoop/conf/core-site.xml on the master node and all worker nodes. Update the configuration to refer to the master hostname. The following uses the example master hostname created earlier.

    ... omitted
    
    <property>
      <name>fs.default.name</name>
      <value>hdfs://master.cluster1.ccloud.net</value>
      <description>The old FileSystem used by FsShell.</description>
    </property>
    <property>
      <name>fs.defaultFS</name>
      <value>hdfs://master.cluster1.ccloud.net</value>
      <description>
        The name of the default file system. A URI whose scheme and authority
        determine the FileSystem implementation. The uri's scheme determines
        the config property (fs.SCHEME.impl) naming the FileSystem
        implementation class. The uri's authority is used to determine the
        host, port, etc. for a filesystem.
      </description>
    </property>
    
    ... omitted
    
  2. Edit the /etc/hadoop/conf/hdfs-site.xml on the master node and all worker nodes. Update the configuration to refer to the master hostname. The following uses the example master hostname created earlier.

    ... omitted
    
    <property>
      <name>dfs.namenode.rpc-address</name>
      <value>master.cluster1.ccloud.net:8020</value>
      <description>
        RPC address that handles all clients requests. If empty then we'll get
        thevalue from fs.default.name.The value of this property will take the
        form of hdfs://nn-host1:rpc-port.
      </description>
    </property>
    
    ... omitted
    
    <property>
      <name>dfs.namenode.servicerpc-address</name>
      <value>master.cluster1.ccloud.net:8051</value>
      <final>false</final>
      <source>Dataproc Cluster Properties</source>
    </property>
    
    ... omitted
    
    <property>
      <name>dfs.namenode.lifeline.rpc-address</name>
      <value>master.cluster1.ccloud.net:8050</value>
      <final>false</final>
      <source>Dataproc Cluster Properties</source>
    </property>
    
    ... omitted
    
  3. At the end of the hdfs-site.xml file on each worker node, add the public DNS name for the node. Create this <property> section for each worker node. This is a required step even if using the default GCP hostnames.

    ... end of file
    
    <property>
      <name>dfs.datanode.hostname</name>
      <value>cluster-fa79-w-0.ccloud.dataproc.lab.net</value>
      <description>
         obscure property
      </description>
    </property>
    

Step 5: Make additional configuration modifications

Note

If you are using the default GCP hostnames, you do not have to complete all of the steps in this procedure. However, make sure to verify everything is set up properly at each step.

Complete the following steps to make additional configuration changes to the nodes_include configuration file and to etc/hosts on each node. You do not have to add these lines if you are using the default GCP hostnames.

  1. Edit the /etc/hadoop/conf/nodes_include on the master node. Add all worker node hostnames. The example below shows the worker hostnames created earlier.

    ... omitted
    
    worker0.cluster1.ccloud.net
    worker1.cluster1.ccloud.net
    
  2. Launch an SSH terminal session on the master node. Add the master hostname and internal IP address to /etc/hosts. The additional line is highlighted in the example below.

    127.0.0.1   localhost
    ::1         localhost ip6-localhost ip6-loopback
    ff02::1     ip6-allnodes
    ff02::2     ip6-allrouters
    10.128.0.6 master.cluster1.ccloud.net  # <-- add this line
    10.128.0.6 cluster-fa79-m.c.ccloud.dataproc.lab.net.internal cluster-fa79-m  # Added by Google
    169.254.169.254 metadata.google.internal  # Added by Google
    
  3. Launch an SSH terminal session on a worker node. Add the worker hostname and internal IP address to /etc/hosts. The additional line is highlighted in each example below. Complete this step for all worker nodes.

    127.0.0.1 localhost
    ::1               localhost ip6-localhost ip6-loopback
    ff02::1           ip6-allnodes
    ff02::2           ip6-allrouters
    10.128.0.2 worker0.cluster1.ccloud.net  # <-- add this line
    10.128.0.2 cluster-fa79-w-0.c.ccloud.dataproc.lab.net.internal cluster-fa79-w-0  # Added by Google
    169.254.169.254 metadata.google.internal  # Added by Google
    
    127.0.0.1 localhost
    ::1               localhost ip6-localhost ip6-loopback
    ff02::1           ip6-allnodes
    ff02::2           ip6-allrouters
    10.128.0.3 worker1.cluster1.ccloud.net  # <-- add this line
    10.128.0.3 cluster-fa79-w-1.c.ccloud.dataproc.lab.net.internal cluster-fa79-w-1  # Added by Google
    169.254.169.254 metadata.google.internal  # Added by Google
    

Step 6: Configure the Dataproc connector

Complete Dataproc connector configuration steps. Configure the Dataproc connector with the gcp.dataproc.use.datanode.hostname configuration property. The example below shows this configuration property added to the configuration. This property defaults to false if not used. Note that for HA deployments, the gcp.dataproc.namenode property supports a comma-separated list of namenodes.

{
  "connector.class": "DataprocSink",
  "name": "dataproc-test",
  "kafka.api.key": "<my-kafka-api-key>",
  "kafka.api.secret": "<my-kafka-api-secret>",
  "topics": "<topic-name>",
  "input.data.format": "AVRO",
  "gcp.dataproc.credentials.json": "<credentials-json-file-contents>",
  "gcp.dataproc.projectId": "<my-dataproc-project-ID",
  "gcp.dataproc.cluster": "<my-dataproc-cluster-name>",
  "gcp.dataproc.namenode": "<public-IP-address or FQDN>",
  "gcp.dataproc.use.datanode.hostname": "true"
  "logs.dir": "<HDFS-logs-directory>",
  "output.data.format": "AVRO",
  "flush.size": "1000",
  "time.interval": "HOURLY",
  "tasks.max": "1"
}

After the configuration settings have been completed, the Dataproc cluster VPC nodes respond over a public IP endpoint to the Confluent Cloud cluster and managed Dataproc connector as shown below.

Public IP address response to Confluent Cloud

Public IP address response to Confluent Cloud