GitHub Source Connector for Confluent Platform

The Kafka Connect GitHub Source connector is used to write meta data (detect changes in real time or consume the history) from GitHub to Apache Kafka® topics. This connector polls data from GitHub through GitHub APIs, converts data into Kafka records, and then pushes the records into a Kafka topic. Each record from GitHub is converted into exactly one Kafka record.

Features

The GitHub Source connector offers the following features:

At least once delivery

This connector guarantees that records are delivered at least once to the Kafka topic. If the connector restarts, there may be some duplicate records in the Kafka topic.

Multiple tasks

The GitHub Source connector supports running only one task–one repository is covered by one task.

API rate limit awareness

The connector stops fetching records from GitHub when the API rate limit is exceeded. Once the API rate limit resets, the connector will resume fetching records.

Supports HTTPS proxy

The connector can connect to GitHub using an HTTPS proxy server. To configure the proxy, you can set http.proxy.host, http.proxy.port, http.proxy.user and http.proxy.password in the configuration file. The connector has been tested with HTTPS proxy with basic authentication.

Limitations

  • For resources that do not support fetching records by datetime, new records are fetched at an interval specified by the request.interval.ms configuration. Records for these resources might get duplicated every time connector restarts.
  • The connector is not be able to detect the deletion of data on GitHub.
  • In the case of connector restarts, the Kafka topic might end up having records that are out of order.
  • GitHub has a defined API request limit. This limit is 5,000 requests per hour. Once this rate limit is exceeded, the connector waits until the API request limit resets.

GitHub Resources

The GitHub connector supports fetching records from the following resources:

  • assignees: Available assignees for the specified repositories, refer the following schema.
  • collaborators: Collaborators for the specified repositories, refer the following schema.
  • issues: Issues in all GitHub states, refer the following schema.
  • comments: Issue comments, refer the following schema.
  • commits: Base branch commits only, refer the following schema.
  • pull_requests: Pull Requests in all GitHub states, refer the following schema.
  • releases: Release for the specified repositories, refer the following schema.
  • reviews: Reviews on pull requests. Reviews can only be fetched with Pull Requests, refer the following schema.
  • review_comments: Review comments on pull requests, refer the following schema.
  • stargazers: Stargazers for the specified repositories, refer the following schema.

License

You can use this connector for a 30-day trial period without a license key.

After 30 days, you must purchase a connector subscription which includes Confluent enterprise license keys to subscribers, along with enterprise-level support for Confluent Platform and your connectors. If you are a subscriber, you can contact Confluent Support at support@confluent.io for more information.

See Confluent Platform license for license properties and Confluent License Properties for information about the license topic

Configuration Properties

For a complete list of configuration properties for this connector, see Configuration Reference for GitHub Source Connector for Confluent Platform.

For an example of how to get Kafka Connect connected to Confluent Cloud, see Connect Self-Managed Kafka Connect to Confluent Cloud.

Install the GitHub Source Connector

You can install this connector by using the confluent connect plugin install command, or by manually downloading the ZIP file.

Prerequisites

  • You must install the connector on every machine where Connect will run.

  • Kafka Broker: Confluent Platform 3.3.0 or later.

  • Connect: Confluent Platform 4.1.0 or later.

  • Java 1.8.

  • No additional setup is required on GitHub account for this connector to work, other than access token with repository and user privileges. For more details, Creating a personal access token for the command line.

  • An installation of the latest (latest) connector version.

    To install the latest connector version, navigate to your Confluent Platform installation directory and run the following command:

    confluent connect plugin install confluentinc/kafka-connect-github:latest
    

    You can install a specific version by replacing latest with a version number as shown in the following example:

    confluent connect plugin install confluentinc/kafka-connect-github:2.1.1
    

Install the connector manually

Download and extract the ZIP file for your connector and then follow the manual connector installation instructions.

Quick Start

In this quick start, you configure the GitHub Source connector to fetch GitHub users who have stared Apache Kafka repository since 2019-01-01 to a Kafka topic called github-stargazers.

Start Confluent

Start the Confluent services using the following Confluent CLI command:

confluent local services start

Important

Do not use the Confluent CLI in production environments.

Properties-based example

Create a file called github-source-quickstart.properties file with following properties:

name=MyGithubConnector
confluent.topic.bootstrap.servers=localhost:9092
confluent.topic.replication.factor=1
tasks.max=1
connector.class=io.confluent.connect.github.GithubSourceConnector
github.service.url=https://api.github.com
github.access.token=<ACCESS-TOKEN>
github.repositories=apache/kafka
github.resources=stargazers
github.since=2019-01-01
topic.name.pattern=github-${resourceName}
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081

Next, load the Source connector.

Caution

You must include a double dash (--) between the topic name and your flag. For more information, see this post.

.confluent local services connect connector load MyGithubConnector --config github-source-quickstart.properties

Your output should resemble the following:

{
    "name": "MyGithubConnector",
    "config": {
        "connector.class": "io.confluent.connect.github.GithubSourceConnector",
        "tasks.max": "1",
        "confluent.topic.bootstrap.servers":"localhost:9092",
        "confluent.topic.replication.factor":"1",
        "github.service.url":"https://api.github.com",
        "github.repositories":"apache/kafka",
        "github.resources":"stargazers",
        "github.since":"2019-01-01",
        "github.access.token":"<Your-Github-Access-Token>",
        "topic.name.pattern":"github-${resourceName}",
        "key.converter":"io.confluent.connect.avro.AvroConverter",
        "key.converter.schema.registry.url":"http://localhost:8081",
        "value.converter":"io.confluent.connect.avro.AvroConverter",
        "value.converter.schema.registry.url":"http://localhost:8081"
    },
    "tasks": [],
    "type": null
}

Enter the following command to confirm that the connector is in a RUNNING state:

confluent local services connect connector status MyGithubConnector

The output should resemble:

{
   "name":"MyGithubConnector",
   "connector":
   {
      "state":"RUNNING",
      "worker_id":"127.0.1.1:8083"
   },
   "tasks":
   [
      {
         "id":0,
         "state":"RUNNING",
         "worker_id":"127.0.1.1:8083"
      }
   ],
   "type":"source"
}

REST-based example

Use this setting with distributed workers. Write the following JSON to config.json, configure all of the required values, and use the following command to post the configuration to one of the distributed Connect workers. Check here for more information about the Kafka Connect REST API.

{
   "name" : "MyGithubConnector",
   "config" :
   {
      "connector.class" : "io.confluent.connect.github.GithubSourceConnector",
      "confluent.topic.bootstrap.servers": "localhost:9092",
      "confluent.topic.replication.factor": "1",
      "tasks.max" : "1",
      "github.service.url":"https://api.github.com",
      "github.access.token":"< Github-Access-Token >",
      "github.repositories":"apache/kafka",
      "github.resources":"stargazers",
      "github.since":"2019-01-01",
      "topic.name.pattern":"github-${resourceName}",
      "key.converter":"io.confluent.connect.avro.AvroConverter",
      "key.converter.schema.registry.url":"http://localhost:8081",
      "value.converter":"io.confluent.connect.avro.AvroConverter",
      "value.converter.schema.registry.url":"http://localhost:8081"
   }
}

Note

For staging or production use:

  • Change the confluent.topic.bootstrap.servers property to include your broker address(es).
  • Change the confluent.topic.replication.factor to 3 for staging or production use.
  • Change http://localhost:8083/ to the endpoint of one of your Connect worker(s).

Use curl to post a configuration to one of the Connect workers.

curl -sS -X POST -H 'Content-Type: application/json' --data @config.json http://localhost:8083/connectors

Confirm that the connector is in a RUNNING state by running the following command:

curl http://localhost:8083/connectors/MyGithubConnector/status

The output should resemble the example below:

{
   "name":"MyGithubConnector",
   "connector":{
      "state":"RUNNING",
      "worker_id":"127.0.1.1:8083"
   },
   "tasks":[
      {
         "id":0,
         "state":"RUNNING",
         "worker_id":"127.0.1.1:8083"
      }
   ],
   "type":"source"
}

Enter the following command to consume records written by the connector to the Kafka topic:

./kafka-avro-console-consumer --bootstrap-server localhost:9092 --topic github-stargazers --from-beginning

The output should resemble the example below:

{
    "type": {
      "string": "STARGAZERS"
    },
    "createdAt": null,
    "data": {
      "data": {
        "login": {
          "string": "User.Name"
        },
        "id": {
          "int": 1234
        },
        "node_id": {
          "string": "MDQ6VXNlcjM0OTE3MTE="
        },
        "avatar_url": {
          "string": "https://avatars2.githubusercontent.com/u/1234?v=4"
        },
        "gravatar_id": {
          "string": ""
        },
        "url": {
          "string": "https://api.github.com/users/User.Name"
        },
        "html_url": {
          "string": "https://github.com/User.Name"
        },
        "followers_url": {
          "string": "https://api.github.com/users/User.Name/followers"
        },
        "following_url": {
          "string": "https://api.github.com/users/User.Name/following{/other_user}"
        },
        "gists_url": {
          "string": "https://api.github.com/users/User.Name/gists{/gist_id}"
        },
        "starred_url": {
          "string": "https://api.github.com/users/User.Name/starred{/owner}{/repo}"
        },
        "subscriptions_url": {
          "string": "https://api.github.com/users/User.Name/subscriptions"
        },
        "organizations_url": {
          "string": "https://api.github.com/users/User.Name/orgs"
        },
        "repos_url": {
          "string": "https://api.github.com/users/User.Name/repos"
        },
        "events_url": {
          "string": "https://api.github.com/users/User.Name/events{/privacy}"
        },
        "received_events_url": {
          "string": "https://api.github.com/users/User.Name/received_events"
        },
        "type": {
          "string": "User"
        },
        "site_admin": {
          "boolean": false
        }
      }
    },
    "id": {
      "string": "1234"
    }
  }

Clean up resources

To clean up the resources, complete the following steps:

  1. Delete the connector:

    confluent local services connect connector unload MyGithubConnector
    
  2. Stop Confluent Platform:

    confluent local stop