Connectors → SFTP

About Secure File Transfer Protocol (SFTP)

Secure File Transfer Protocol (SFTP) is a file transfer protocol that leverages a set of utilities that provide secure access to a remote computer to deliver secure communications. It is considered by many to be the optimal method for secure file transfer. It leverages SSH Secure Socket Shell or Secure Shell (SSH) and is frequently also referred to as “Secure Shell File Transfer Protocol”.

SFTP Connector Updates

This section is to explore the updates in the newer versions of the SFTP connector available on the Incorta connectors marketplace.

In order to get the newer version of the connector, please update the connector using the marketplace.

VersionUpdates
2.0.1.8Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the SFTP connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times
2.2.1.0Requires upgrading the cluster to release 2024.7.x or 2024.1.7.
  ●  Support reading PGP-encrypted files
2.2.2.0Requires upgrading the cluster to release 2024.7.x or 2024.1.7.
  ●  Support SSH keys for user authentication
Recommendation

Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.

About the SFTP connector

The SFTP connector allows you to connect to a remote computer or server where the files you need to load are stored. Incorta is able to load the following file types from an SFTP server:

  • Text (.csv, .tsv, .tab, .txt)
  • Excel (.xlsx)

The SFTP connector supports the following Incorta-specific functionality:

FeatureSupported
Chunking
Data Agent
Encryption at Ingest
Incremental Load
Multi-Source
OAuth
Performance Optimized
Remote
Single-Source
Spark Extraction
Webhook Callbacks

Steps to use the SFTP connector

Here are the high level steps to use the SFTP connector:

Create an external data source

Here are the steps to create an external data source with the SFTP connector:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Data.
  • In the Action bar, select + NewAdd Data Source.
  • In the Choose a Data Source dialog, in Data lake, select Data Lake - SFTP.
  • In the New Data Source dialog, specify the applicable connector properties.
  • To test, select Test Connection.
  • Select Ok to save your changes.

SFTP connector properties

Here are the properties for the SFTP connector:

PropertyControlDescription
Name Your Data Sourcetext boxEnter a name for your data source
Host Nametext boxEnter the address of your SFTP server.
Porttext boxEnter the port number of the SFTP server. The default port number is: 22.
Directorytext boxEnter the SFTP server path of the files you want to load. For example: sftp://root/folder/subfolder/data
Authentication Typedropdown listThis option is available starting 2024.7.x and 2024.1.7 with version 2.2.2.0.
Select the authentication type:
  ●  Basic: via a username and password
  ●  SSH via a username and an SSH key
Usernametext boxEnter the username for the SFTP account used to access the data source.
Passwordtext boxEnter the password for the SFTP account used to access the data source.
This option is required when you select basic authentication.
SSH KeybuttonSelect the button to upload the SSH key used to authenticate the user.
This option is required when you select SSH authentication.
PGP Encrypted FilestoggleThis option is available starting 2024.7.x and 2024.1.7 with version 2.2.1.0.
Enable this option if data files are encrypted using Pretty Good Privacy (PGP)
PGP Private KeybuttonAvailable when you enable PGP Encrypted Files.
Select this button to upload the PGP private key.
PGP Private Key Passphrasetext boxAvailable when you enable PGP Encrypted Files.
Optionally, enter the passphrase that PGP uses to encrypt your private key.
Extra Optionstext boxEnter supported extra options in the form of key=value.

Create a physical schema with the Schema Wizard

Here are the steps to create a SFTP physical schema with the Schema Wizard:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + NewSchema Wizard
  • In (1) Choose a Source, specify the following:
    • For Enter a name, enter the physical schema name.
    • For Select a Datasource, select the SFTP external data source.
    • Optionally enter a description.
  • Select Next.
  • In (2) Manage Tables, in the Data Panel, navigate the directory tree as necessary to select the SFTP files. You can either check the Select All checkbox or select individual sheets.
  • Select Next.
  • In (3) Finalize, select Create Schema.

Create a physical schema with the Schema Designer

Here are the steps to create a SFTP physical schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + NewCreate Schema.
  • In Name, specify the physical schema name, and select Save.
  • In Start adding tables to your schema, select Data Lake.
  • In the Data Source dialog, specify the SFTP table data source properties.
  • Select Add.
  • In the Table Editor, in the Table Summary section, enter the table name.
  • To save your changes, select Done in the Action bar.

SFTP table data source properties

For a physical schema table in Incorta, you can define the SFTP specific data source properties as follows:

PropertyControlDescription
Typedrop down listDefault is Data Lake.
Data Sourcedrop down listSelect the SFTP external data source.
RemotetoggleEnable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized property affects data accessibility.
File Typedrop down listSelect a file type option:
  ●  Text (.csv, .tsv, .tab, .txt)
  ●   Excel (.xlsx).
IncrementaltoggleEnables incremental loading for the physical schema table.
Has Header?toggleThis property appears when Remote is disabled and the File Type is Text.
Enable this property to indicate the data source has a header row.
Rows to Skiptext boxThis property appears when Remote is disabled and the File Type is Text.
Enter the number of rows to skip from the top of the file.
Wildcard UniontoggleEnable this property to get incremental data file updates from an existing directory.
File Pathtext boxThis property appears when Wildcard Union is disabled.
Enter the path to the data file, relative to the root directory configured in the data source.
Worksheettext boxThis property appears when Wildcard Union is disabled and the File Type is Excel.
Select the data file worksheet of interest.
Update Filetext boxThis property appears when Incremental is enabled and Wildcard Union is disabled.
Enter the path to the update file, relative to the root directory configured in the data source.
Update Worksheettext boxThis property appears when the File Type is Excel, Incremental is enabled, and Wildcard Union is disabled.
Select the update file worksheet of interest.
Incremental Extract Usingdrop down listThis property appears when Incremental and Wildcard Union are enabled.
Select an incremental load method.
Timestamp format in file namedrop down listThis property appears when the Timestamp in File Name option is selected for the Incremental Extract Using property.
Select the timestamp format that appears in the file name.
Directory Pathtext boxThis property appears when Wildcard Union is enabled.
Enter the path to the directory, relative to the root directory configured in the data source. To use the root directory, enter ./ or .
Apply Include Pattern ondrop down listThis property appears when Wildcard Union is enabled.
Select one of the following:

  ●  Filename - apply pattern on all file names in the selected directory path
  ●  File Relative Path - apply pattern on relative path in the selected directory path
Includetext boxThis property appears when Wildcard Union is enabled.
So that only those files matching the prefix are loaded, enter a prefix to compare against:

  ●  The names of the files in a directory if Apply Include Pattern has a value of File Name. For example, entering sales* .csv will load only those files that start with the word sales and end with .csv.
  ●  The relative path in a directory if Apply Include Pattern has a value of File Relative Path. For example, entering sales will load those files in the sales directory.
ExcludetoggleThis property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded.
Include Sub-DirectoriestoggleThis property appears when Wildcard Union is enabled.
Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include Prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded.
Include Filename as a ColumntoggleThis property appears when Wildcard Union is enabled.
Enable this property to add the file name as the first column in the physical schema table.
Date Formatdrop down listThis property appears when the File Type is Text.
Select the text file date format.
Timestamp Formatdrop down listThis property appears when the File Type is Text.
Select the text file timestamp format.
Character Setdrop down listThis property appears when the File Type is Text. Select the text file character set.
Separatordrop down listThis property appears when the File Type is Text.
Select the text file separator.
CallbacktoggleEnable this property on to expose the Callback URL field
Callback URLtext boxThis property appears when the Callback toggle is enabled.
Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table PropertiesData Source
Properties
ParquetDDMMemorySQLiMaterialized
View
Analytics
Performance Optimized = OffRemote = OnNoNoNoYesYesNo
Performance Optimized = OffRemote = OffYesYesNoYesYesNo, unless created by a materialized view
or Notebook
Performance Optimized = OnRemote = OffYesYesYesYesYesYes

Incremental Extract Methods

There are two extract methods, Last Successful Extract Time and Timestamp in File Name.

Last Successful Extract Time

The Last Successful Extract Time option instructs the Loader Service to extract data from the table source using the time of the previous successful extraction as the starting time for next extraction. Here is an example use case for the Last Successful Extract Time option:

  • /path/to/sales is a directory that contains the following files:
    • /path/to/sales/sales_california.csv
    • /path/to/sales/sales_newyork.csv
    • /path/to/sales/sales_illinois.csv

When you perform an initial full load, the Loader Service extracts all the files from the directory. After the successful load, a new file is added to the directory: sales_ohio.csv. As the new file has a last modified timestamp that is more recent than the timestamp of the previous successful extraction, the next incremental load will only extract the sales_ohio.csv file.

Timestamp in File Name

The Timestamp in File Name option instructs the Loader Service to extract data using the file name. The file name must contain a specific timestamp format. Here is an example use case for the Timestamp in File Name option:

  • /path/to/sales is a directory that contains the following files:
    • /path/to/sales/sales_2021-04-01.csv
    • /path/to/sales/sales_2021-04-02.csv
    • /path/to/sales/sales_2021-04-03.csv

When you perform an initial full load, the Loader Service extracts all the files from the directory. After the successful load, a new file is added to the directory: sales_2021-04-04.csv. As the new file has a timestamp name value that is greater than the already extracted timestamp name values, the next incremental load will only extract the sales_2021-04-04.csv file.

Timestamp Formats in File Name
  • yyyy-MM-dd
  • dd.MM.yyyy
  • dd-MMM-yy
  • dd-MMM-yyyy
  • yyyy-MM-dd HH.mm.ss
  • Unix Epoch (seconds)
  • Unix Epoch (milliseconds)
Text File Date Format
  • yyyy-MM-dd
  • dd/MM/yyyy
  • dd.MM.yyyy
  • dd/MMM/yyyy
  • dd-MMM-yy
  • dd-MMM-yyyy
  • MM/dd/yyyy
  • yyyy/MM/dd
  • Unix Epoch (seconds
  • Unix Epoch (milliseconds)
  • Other
Text File Timestamp Format
  • yyyy-MM-dd HH:mm:ss
  • yyyy-MM-dd HH.mm.ss
  • yyyy-MM-dd HH:mm:ss.SSS
  • dd/MM/yyyy HH:mm:ss
  • dd/MM/yyyy HH.mm.ss
  • dd/MM/yyyy HH:mm:ss.SSS
  • Unix Epoch (seconds
  • Unix Epoch (milliseconds)
  • Other
Text File Character Set
  • US-ASCII
  • ISO-8859-1
  • UTF-8
  • UTF-16BE
  • UTF-16LE
  • UTF-16
Text File Separator
  • Comma
  • Tab
  • Other

View the physical schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of physical schemas, select the SFTP physical schema.
  • In the Schema Designer, in the Action bar, select Diagram.

Load the physical schema

Here are the steps to perform a Full Load of the SFTP physical schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of physical schemas, select the SFTP physical schema.
  • In the Schema Designer, in the Action bar, select LoadFull Load.
  • To review the load status, in Last Load Status, select the date.

Explore the physical schema

With the full load of the SFTP physical schema completed, you can use the Analyzer to explore the physical schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the physical schema, follow these steps:

  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the List view, select the SFTP physical schema.
  • In the Schema Designer, in the Action bar, select Explore Data.