SFTP Source Connector for Confluent Cloud

The fully-managed SFTP Source connector for Confluent Cloud watches an SFTP directory for files and reads the data as new files get written to the directory. Each file is parsed based on one of the following property values used with the input.file.parser.format configuration, which are also selectable in the UI.

  • BINARY
  • CSV
  • JSON (the default)
  • SCHEMALESS_JSON

Once a file has been read, it is placed into a finished.path directory or an error.path directory.

Note

This is a Quick Start for the fully-managed cloud connector. If you are installing the connector locally for Confluent Platform, see SFTP Source Connector for Confluent Platform.

Features

The SFTP Source connector supports the following features:

  • At least once delivery: The connector guarantees that records are delivered at least once to the Kafka topic (if the file row parsed is valid).
  • Supports one task: The connector supports running one task per connector instance.
  • Supported output data formats: The connector supports Avro, JSON Schema (JSON-SR), Protobuf, JSON (schemaless), Bytes, and String output record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf).

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.

Limitations

Be sure to review the following information.

Quick Start

Use this quick start to get up and running with the Confluent Cloud SFTP Source connector. The quick start provides the basics of selecting the connector and configuring it to get data from an SFTP host.

Prerequisites
  • Authorized access to a Confluent Cloud cluster on Amazon Web Services (AWS), Microsoft Azure (Azure), or Google Cloud.
  • The Confluent CLI installed and configured for the cluster. See Install the Confluent CLI.
  • Access to an SFTP host.
  • Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON_SR (JSON Schema), or Protobuf).
  • At least one source Kafka topic must exist in your Confluent Cloud cluster before creating the source connector.

Using the Confluent Cloud Console

Step 1: Launch your Confluent Cloud cluster

See the Quick Start for Confluent Cloud for installation instructions.

Step 2: Add a connector

In the left navigation menu, click Connectors. If you already have connectors in your cluster, click + Add connector.

Step 3: Select your connector

Click the SFTP Source connector card.

SFTP Source Connector Card

Note

  • Make sure you have all your prerequisites completed.
  • An asterisk ( * ) designates a required entry.
  • The steps provide information about how to use the required configuration properties. See Configuration Properties for other configuration property values and descriptions.

At the Add SFTP Source Connector screen, complete the following:

Select the topic you want to send data to from the Topics list. To create a new topic, click +Add new topic.

Step 5: Check for records

Verify that records are being produced in the Kafka topic.

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.

Tip

When you launch a connector, a Dead Letter Queue topic is automatically created. See Confluent Cloud Dead Letter Queue for details.

Using the Confluent CLI

To set up and run the connector using the Confluent CLI, complete the following steps.

Note

Make sure you have all your prerequisites completed.

Step 1: List the available connectors

Enter the following command to list available connectors:

confluent connect plugin list

Step 2: List the connector configuration properties

Enter the following command to show the connector configuration properties:

confluent connect plugin describe <connector-plugin-name>

The command output shows the required and optional configuration properties.

Step 3: Create the connector configuration file

Create a JSON file that contains the connector configuration properties. The following example shows the required connector properties.

{
  "connector.class": "SftpSource",
  "name": "SftpSourceConnector_0",
  "kafka.api.key": "****************",
  "kafka.api.secret": "*********************************",
  "kafka.topic": "orders",
  "output.data.format": "JSON",
  "input.file.parser.format": "CSV",
  "schema.generation.enable": "true",
  "sftp.host": "192.168.1.231",
  "sftp.username": "connect-user",
  "sftp.password:": "****************",
  "input.path": "/path/to/data",
  "finished.path": "/path/to/finished",
  "error.path": "/path/to/error",
  "input.file.pattern": "csv-sftp-source.csv",
  "tasks.max": "1",
}

Note the following property definitions:

  • "connector.class": Identifies the connector plugin name.
  • "name": Sets a name for your new connector.
  • "kafka.auth.mode": Identifies the connector authentication mode you want to use. There are two options: SERVICE_ACCOUNT or KAFKA_API_KEY (the default). To use an API key and secret, specify the configuration properties kafka.api.key and kafka.api.secret, as shown in the example configuration (above). To use a service account, specify the Resource ID in the property kafka.service.account.id=<service-account-resource-ID>. To list the available service account resource IDs, use the following command:

    confluent iam service-account list
    

    For example:

    confluent iam service-account list
    
       Id     | Resource ID |       Name        |    Description
    +---------+-------------+-------------------+-------------------
       123456 | sa-l1r23m   | sa-1              | Service account 1
       789101 | sa-l4d56p   | sa-2              | Service account 2
    
  • "kafka.topic": Enter the topic name or a comma-separated list of topic names.

  • "output.data.format": The connector supports Avro, JSON Schema (JSON_SR), Protobuf, JSON (schemaless), Bytes, and String output Kafka record value formats. Schema Registry must be enabled to use a Schema Registry-based format (for example, Avro, JSON Schema, or Protobuf). See Schema Registry Enabled Environments for additional information.

    Note

    Note the following relationship between output.data.format and the input.file.parser.format property.

    • If you use BINARY for input.file.parser.format, you must use BYTES for output.data.format.
    • If you use SCHEMALESS_JSON for input.file.parser.format, you must use STRING for output.data.format.
    • If you leave this to JSON (the default) or use CSV for input.file.parser.format, you can use any format for output.data.format.
  • "input.file.parser.format": The parser format used to parse fetched files from the SFTP directory. Defaults to JSON. Options are BINARY, CSV, JSON, SCHEMALESS_JSON.

    Important

    • If you use JSON (the default) or CSV as the input.file.parser.format, then you must add the property schema.generation.enable and set it to true. If you set this property to false, you must provide a key.schema and a value.schema.

    • key.schema and value.schema properties require the actual schema, not the schema ID. To generate the schema, use the tool available here.

      Schema example:
      key.schema={\"name\" : \"com.example.users.UserKey\",\"type\" : \"STRUCT\",\"isOptional\" : false,
        \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false}}}
      value.schema={\"name\" : \"com.example.users.User\",\"type\" : \"STRUCT\",\"isOptional\" : false,
        \"fieldSchemas\" : {\"id\" : {\"type\" : \"INT64\",\"isOptional\" : false},
        \"first_name\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"last_name\" : {\"type\" : \"STRING\",
        \"isOptional\" : true},\"email\" : {\"type\" : \"STRING\",\"isOptional\" : true},
        \"gender\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"ip_address\" : {\"type\" : \"STRING\",
        \"isOptional\" : true},\"last_login\" : {\"type\" : \"STRING\",\"isOptional\" : true},
        \"account_balance\" : {\"name\" : \"org.apache.kafka.connect.data.Decimal\",\"type\" : \"BYTES\",
        \"version\" : 1,\"parameters\" : {\"scale\" : \"2\"},\"isOptional\" : true},
        \"country\" : {\"type\" : \"STRING\",\"isOptional\" : true},\"favorite_color\" : {\"type\" : \"STRING\",
        \"isOptional\" : true}}}
      
  • "sftp.host": Enter the host address for the SFTP server. For example 192.168.1.231. Note that the port defaults to 22. To change this, add the property "sftp.port".

  • "sftp.username": Enter the user name that the connector will use to connect to the host. The "sftp.password" property is not required if a PEM file is used for key based authentication to the host.

  • "input.path": Add the SFTP directory where the connector places files that are successfully processed. This directory must exist and be writable by the connector.

  • "finished.path": Add the SFTP directory from which the connector reads files that will be processed. This directory must exist and be writable by the connector.

  • "error.path": Add the SFTP directory where the connector places files in which there are errors. This directory must exist and be writable by the connector.

  • "input.file.pattern": Add a regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches(). Using .* accepts all files in the directory.

  • "tasks.max": The connector supports running one tasks per connector.

Single Message Transforms: See the Single Message Transforms (SMT) documentation for details about adding SMTs using the CLI.

See Configuration Properties for all property values and descriptions.

Step 4: Load the properties file and create the connector

Enter the following command to load the configuration and start the connector:

confluent connect cluster create --config-file <file-name>.json

For example:

confluent connect cluster create --config-file sftp-source-config.json

Example output:

Created connector SftpSourceConnector_0 lcc-do6vzd

Step 5: Check the connector status.

Enter the following command to check the connector status:

confluent connect cluster list

Example output:

ID           |             Name            | Status  | Type   | Trace
+------------+-----------------------------+---------+--------+-------+
lcc-do6vzd   | SftpSourceConnector_0       | RUNNING | source |       |

Step 6: Check the Kafka topic

Verify that records are being produced at the Kafka topic.

For more information and examples to use with the Confluent Cloud API for Connect, see the Confluent Cloud API for Managed and Custom Connectors section.

Tip

When you launch a connector, a Dead Letter Queue topic is automatically created. See Confluent Cloud Dead Letter Queue for details.

Configuration Properties

Use the following configuration properties with the fully-managed connector. For self-managed connector property definitions and other details, see the connector docs in Self-managed connectors for Confluent Platform.

How should we connect to your data?

name

Sets a name for your connector.

  • Type: string
  • Valid Values: A string at most 64 characters long
  • Importance: high

Kafka Cluster credentials

kafka.auth.mode

Kafka Authentication mode. It can be one of KAFKA_API_KEY or SERVICE_ACCOUNT. It defaults to KAFKA_API_KEY mode.

  • Type: string
  • Default: KAFKA_API_KEY
  • Valid Values: KAFKA_API_KEY, SERVICE_ACCOUNT
  • Importance: high
kafka.api.key

Kafka API Key. Required when kafka.auth.mode==KAFKA_API_KEY.

  • Type: password
  • Importance: high
kafka.service.account.id

The Service Account that will be used to generate the API keys to communicate with Kafka Cluster.

  • Type: string
  • Importance: high
kafka.api.secret

Secret associated with Kafka API key. Required when kafka.auth.mode==KAFKA_API_KEY.

  • Type: password
  • Importance: high

Which topic do you want to send data to?

kafka.topic

Identifies the topic name to write the data to.

  • Type: string
  • Importance: high

Schema Config

schema.context.name

Add a schema context name. A schema context represents an independent scope in Schema Registry. It is a separate sub-schema tied to topics in different Kafka clusters that share the same Schema Registry instance. If not used, the connector uses the default schema configured for Schema Registry in your Confluent Cloud environment.

  • Type: string
  • Default: default
  • Importance: medium

Output messages

output.data.format

Sets the output message format. Valid entries are AVRO, JSON_SR, PROTOBUF, JSON, STRING or BYTES. Note that you need to have Confluent Cloud Schema Registry configured if using a schema-based message format like AVRO, JSON_SR, and PROTOBUF

  • Type: string
  • Importance: high

Input file parser format

input.file.parser.format

Parser that should be used to parse fetched files from sftp directory

  • Type: string
  • Default: JSON
  • Importance: high

SFTP Details

sftp.host

Host address of the SFTP server.

  • Type: string
  • Importance: high
sftp.port

Port number of the SFTP server.

  • Type: int
  • Default: 22
  • Importance: medium
sftp.username

Username for the SFTP connection.

  • Type: string
  • Importance: high
sftp.password

Password for the SFTP connection (not required if using TLS).

  • Type: password
  • Importance: high
tls.pemfile

PEM file to be used for authentication via TLS.

  • Type: password
  • Importance: high
tls.passphrase

Passphrase that will be used to decrypt the private key if the given private key is encrypted.

  • Type: password
  • Importance: high

SFTP directory

input.path

The SFTP directory to read files that will be processed.This directory must exist and be writable by the user running Kafka Connect.

  • Type: string
  • Importance: high
finished.path

The SFTP directory to place files that have been successfully processed. This directory must exist and be writable by the user running Kafka Connect.

  • Type: string
  • Importance: high
error.path

The SFTP directory to place files in which there are error(s). This directory must exist and be writable by the user running Kafka Connect.

  • Type: string
  • Importance: high

File System

cleanup.policy

Determines how the connector should cleanup the files that have been successfully processed. NONE leaves the files in place which could cause them to be reprocessed if the connector is restarted. DELETE removes the file from the filesystem. MOVE will move the file to a finished directory.

  • Type: string
  • Default: MOVE
  • Importance: medium
input.file.pattern

Regular expression to check input file names against. This expression must match the entire filename. The equivalent of Matcher.matches().

  • Type: string
  • Importance: high
behavior.on.error

Should the task halt when it encounters an error or continue to the next file.

  • Type: string
  • Default: FAIL
  • Importance: high
file.minimum.age.ms

The amount of time in milliseconds after the file was last written to before the file can be processed. For default 0, connector processes all files irrespective of age

  • Type: long
  • Default: 0
  • Importance: low

Connection details

batch.size

The number of records that should be returned with each batch.

  • Type: int
  • Default: 1000
  • Importance: low
empty.poll.wait.ms

The amount of time to wait if a poll returns an empty list of records.

  • Type: long
  • Default: 250
  • Importance: low

Schema

key.schema

The schema for the key written to Kafka.

  • Type: string
  • Importance: high
value.schema

The schema for the value written to Kafka.

  • Type: string
  • Importance: high

Schema Generation

schema.generation.enabled

Flag to determine if schemas should be dynamically generated. If set to true, key.schema and value.schema can be omitted, but schema.generation.key.name and schema.generation.value.name must be set.

  • Type: boolean
  • Importance: medium
schema.generation.key.fields

The field(s) to use to build a key schema. This is only used during schema generation.

  • Type: list
  • Importance: medium
schema.generation.key.name

The name of the generated key schema.

  • Type: string
  • Importance: medium
schema.generation.value.name

The name of the generated value schema.

  • Type: string
  • Importance: medium

Timestamps

timestamp.mode

Determines how the connector will set the timestamp for the ConnectRecord. If set to FIELD then the timestamp will be read from a field in the value. This field cannot be optional and must be a Timestamp. Specify the field in timestamp.field. If set to FILE_TIME then the last modified time of the file will be used. If set to PROCESS_TIME the time the record is read will be used.

  • Type: string
  • Importance: medium
timestamp.field

The field in the value schema that will contain the parsed timestamp for the record. This field cannot be marked as optional and must be a [Timestamp] (https://kafka.apache.org/0102/javadoc/org/apache/kafka/connect/data/Schema.html)

  • Type: string
  • Importance: medium
parser.timestamp.timezone

The timezone that all of the dates will be parsed with.

  • Type: string
  • Importance: low
parser.timestamp.date.formats

The date formats that are expected in the file. This is a list of strings that will be used to parse the date fields in order. The most accurate date format should be the first in the list. Take a look at the Java documentation for more info. https://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html

  • Type: list
  • Importance: low

CSV Parsing

csv.skip.lines

Number of lines to skip in the beginning of the file.

  • Type: int
  • Default: 0
  • Importance: low
csv.separator.char

The character that separates each field in the form of an integer. Typically in a CSV this is a ,(44) character. A TSV would use a tab(9) character. If csv.separator.char is defined as a null(0), then the RFC 4180 parser must be utilized by default. This is the equivalent of csv.rfc.4180.parser.enabled = true.

  • Type: int
  • Default: 44
  • Importance: low
csv.quote.char

The character that is used to quote a field. Typically in a CSV this is a “(34) character. This typically happens when the csv.separator.char character is within the data.

  • Type: int
  • Default: 34
  • Importance: low
csv.escape.char

The character as an integer to use when a special character is encountered. The default escape character is typically a (92)

  • Type: int
  • Default: 92
  • Importance: low
csv.strict.quotes

Sets the strict quotes setting - if true, characters outside the quotes are ignored.

  • Type: string
  • Default: false
  • Importance: low
csv.ignore.leading.whitespace

Sets the ignore leading whitespace setting - if true, white space in front of a quote in a field is ignored.

  • Type: string
  • Importance: low
csv.ignore.quotations

Sets the ignore quotations mode - if true, quotations are ignored.

  • Type: string
  • Default: false
  • Importance: low
csv.keep.carriage.return

Flag to determine if the carriage return at the end of the line should be maintained.

  • Type: string
  • Default: false
  • Importance: low
csv.null.field.indicator

Indicator to determine how the CSV Reader can determine if a field is null. Valid values are EMPTY_SEPARATORS, EMPTY_QUOTES, BOTH, NEITHER. For more information see http://opencsv.sourceforge.net/apidocs/com/opencsv/enums/CSVReaderNullFieldIndicator.html.

  • Type: string
  • Default: NEITHER
  • Importance: low
csv.first.row.as.header

Flag to indicate if the fist row of data contains the header of the file. If true the position of the columns will be determined by the first row to the CSV. The column position will be inferred from the position of the schema supplied in value.schema. If set to true the number of columns must be greater than or equal to the number of fields in the schema.

  • Type: string
  • Importance: medium
csv.file.charset

Character set to read wth file with.

  • Type: string
  • Default: UTF-8
  • Importance: low
ui.csv.pre.validate.file.enabled

Flag to enable validating the integrity of all records in the CSV file before processing any of its records. For example, if any of the records have a linefeed within an unquoted field, which would incorrectly break the record at that point, then the entire fil will be considered erroneous and no records from that file will be processed. The failed file would be moved to the configured error path. Important: If the number of records in a file is larger than the configured batch size, then portions of the file may be retrieved from the sftp server by the connector more than once.

  • Type: string
  • Default: NO
  • Valid Values: NO, YES
  • Importance: low

Number of tasks for this connector

tasks.max

Maximum number of tasks for the connector.

  • Type: int
  • Valid Values: [1,…,1]
  • Importance: high

Next Steps

For an example that shows fully-managed Confluent Cloud connectors in action with Confluent Cloud ksqlDB, see the Cloud ETL Demo. This example also shows how to use Confluent CLI to manage your resources in Confluent Cloud.

../_images/topology.png