Connectors → Amazon Web Services (AWS) S3

About Amazon Web Services (AWS) S3

Amazon Simple Storage Service— known as Amazon S3 or AWS S3 — is an object storage service available in the Amazon Web Services cloud. Using either the AWS Console, AWS CLI, or similar, both users and applications upload and download files to AWS S3.

AWS S3 stores files in a bucket using object storage. A bucket is a cloud resource that is similar to a directory or folder. In a bucket, you can store objects. An object is a type of file. The file can be a structured, semi-structured, or unstructured file. As an object, the file also has descriptive metadata.

About the AWS S3 connector

The AWS S3 connector enables Incorta to access files stored in an S3 bucket. Incorta is able to load the following file types from an S3 bucket:

Text (.csv, .tsv, .tab, .txt)
Excel (.xlsx)
Parquet (.parquet)
Optimized Row Columnar (.orc)

The AWS S3 connector also supports the use of Remote tables. Remote tables enable you to access large CSV, Parquet, and ORC files without loading them into Incorta memory. Duplicating these large files wastes disk space and uses too much memory when only a small portion of this data might be needed.

The AWS S3 connector supports the following Incorta specific functionality:

Feature	Supported
Chunking	✔
Data Agent
Encryption at Ingest
Incremental Load	✔
Multi-Source	✔
OAuth
Performance Optimized	✔
Remote	✔
Single-Source	✔
Spark Extraction
Webhook Callbacks	✔

Steps to connect AWS S3 and Incorta

When you connect AWS S3 and Incorta, you authenticate to the desired bucket. Depending on the provided credentials, a bucket’s access permissions or policies may require changes to allow access. To familiarize yourself with access management for buckets and objects, see Identity and Access Management in Amazon S3.

While Amazon S3 allows anonymous authentication to buckets and their objects, the S3 connector requires specific user credentials. These credentials include the Access Key ID and Secret Access Key of either the root Amazon AWS user or an IAM user account. The EC2 instance profile is not supported.

The root user for your organization's AWS account can create an IAM user and grant this user the necessary access rights to the AWS S3 bucket. The same administrator can also generate access keys for the IAM user. There are two parts of a key: the Access Key ID (AKIAIOSFODNN7EXAMPLE) and the Secret Access Key (wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY). You need both parts to access an AWS S3 bucket using this connector.

To connect AWS S3 and Incorta, here are the high level steps, tools, and procedures:

Create an external data source
Create a schema with the Schema Wizard
or, Create a schema with the Schema Designer
Load the schema
Explore the schema

Create an external data source

Here are the steps to create a external data source with the AWS S3 connector:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Data.
In the Action bar, select + New → Add Data Source.
In the Choose a Data Source dialog, in Data lake, select Data Lake - AWS S3.
In the New Data Source dialog, specify the applicable connector properties.
To test, select Test Connection.
Select Ok to save your changes.

AWS S3 connector properties

Here are the properties for the AWS S3 connector:

Property	Control	Description
Data Source Name	text box	Enter the name of the data source
Access Key ID	text box	Enter the API Key ID required to access the data
Secret Access Key	text box	Enter the Secret Access Key required to access the data
Bucket	text box	Enter the `s3a` bucket URI to the bucket. The default format is: `s3a://<bucket_name>/path/to/root/directory`
Maximum Concurrent Connections	text box	The maximum number of simultaneous connections to S3. The default is 256.
Region	drop down list	Select the region of the S3 bucket. This property is required for Incorta version 4.9.1 and later.

Note: An s3 URI requires the s3a prefix

For the bucket, you must specify the s3 URI in the s3a format. For example, when you copy the s3 URI for a given bucket for the AWS console, the copied value is in s3 URI format. You must change the prefix from s3:// to s3a://. Here is an example: s3://content.incorta.com/nyc-taxi/yellow_tripdata/200901-201412/ You must change this to: s3a://content.incorta.com/nyc-taxi/yellow_tripdata/200901-201412/

Create a schema with the Schema Wizard

Here are the steps to create an AWS S3 schema with the Schema Wizard:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the Action bar, select + New → Schema Wizard
In (1) Choose a Source, specify the following:
- For Enter a name, enter the schema name.
- For Select a Datasource, select the AWS S3 external data source.
- Optionally create a description.
In the Schema Wizard footer, select Next.
In (2) Manage Tables, in the Data Panel, navigate the directory tree as necessary to select the AWS S3 files. You can either check the Select All checkbox or select individual sheets.
In the Schema Wizard footer, select Next.
In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create an AWS S3 schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the Action bar, select + New → Create Schema.
In Name, specify the schema name, and select Save.
In Start adding tables to your schema, select Data Lake.
In the Data Source dialog, specify the AWS S3 table data source properties.
Select Add.
In the Table Editor, in the Table Summary section, enter the table name.
To save your changes, select Done in the Action bar.

AWS S3 table data source properties

For a schema table in Incorta, you can define AWS S3 specific data source properties as follows:

Property	Control	Description
Type	drop down list	Default is Data Lake
Data Source	drop down list	Select the AWS S3 external data source
Remote	toggle	Enable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized property affects data accessibility.
File Type	drop down list	Select a file type option: ● Text (csv, tsv, tab, txt) ● Excel (xlsx) - not an option with Remote enabled ● Parquet ● ORC
Incremental	toggle	Enables incremental loading for the schema table
Has Header?	toggle	This property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row.
Rows to Skip	text box	This property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file.
Wildcard Union	toggle	Enable this property to get incremental data file updates from an existing directory
File Path	text box	This property appears when Wildcard Union is disabled. Enter the path to the data file, relative to the root directory configured in the data source.
Worksheet	text box	This property appears when Wildcard Union is disabled and the File Type is Excel. Select the data file worksheet of interest.
Update File	text box	This property appears when Incremental is enabled and Wildcard Union is disabled. Enter the path to the update file, relative to the root directory configured in the data source.
Update Worksheet	text box	This property appears when the File Type is Excel, Incremental is enabled, and Wildcard Union is disabled. Select the update file worksheet of interest.
Incremental Extract Using	drop down list	This property appears when Incremental and Wildcard Union are enabled. Select an incremental load method.
Timestamp format in file name	drop down list	This property appears when the Timestamp in File Name option is selected for the Incremental Extract Using property. Select the timestamp format that appears in the file name.
Directory Path	text box	This property appears when Wildcard Union is enabled. Enter the path to the directory, relative to the root directory configured in the data source. To use the root directory, enter `./` or `.`
Apply Include Pattern on	drop down list	This property appears when Wildcard Union is enabled. Select either: ● File Name - apply pattern on all file names in the selected directory path ● File Relative Path - apply pattern on relative path in the selected directory path
Include	text box	This property appears when Wildcard Union is enabled. So that only those files matching the prefix are loaded, enter a prefix to compare against: ● The names of the files in a directory if Apply Include Pattern on has a value of File Name. For example, entering `sales* .parquet` will load only those files that start with the word `sales` and end with `.parquet`. ● The relative path in a directory if Apply Include Pattern has a value of File Relative Path. For example, entering `sales` will load those files in the `sales` directory.
Include Sub-Directories	toggle	This property appears when Wildcard Union is enabled. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded.
Include Filename as a Column	toggle	This property appears when Wildcard Union is enabled. Enable this property to add the file name as the first column in the schema table.
Date Format	drop down list	This property appears when the File Type is Text. Select the text file date format.
Timestamp Format	drop down list	This property appears when the File Type is Text. Select the text file timestamp format.
Character Set	drop down list	This property appears when the File Type is Text. Select the text file character set.
Separator	drop down list	This property appears when the File Type is Text. Select the text file separator.
Enable Chunking	toggle	This property appears when the File Type is Text. Turn this property on to process the text file in chunks.
Callback	toggle	Enable this property on to expose the Callback URL field
Callback URL	text box	This property appears when the Callback toggle is enabled. Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table Properties	Data Source Properties	Parquet	DDM	Memory	SQLi	MV/ Notebooks	Analytics
Performance Optimized = Off	Remote = On	No	No	No	Yes	Yes	No
Performance Optimized = Off	Remote = Off	Yes	Yes	No	Yes	Yes	No, unless populated via MV/Notebook
Performance Optimized = On	Remote = Off	Yes	Yes	Yes	Yes	Yes	Yes

Incremental Extract Methods

Last Successful Extract Time: This option will load data from the time the last successful extract occurred.
Here is an example use case of Last Successful Extract Time:
- A directory containing all the sales data is located at /path/to/sales
- The directory contains the following files: /path/to/sales/sales_california.parquet, /path/to/sales/sales_newyork.parquet, /path/to/sales/illinois.parquet
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_ohio.parquet, the next incremental load will pick up this file since its last modified timestamp will be more recent than that of the files extracted in the previous full load.
Timestamp in File Name: This option will load data from the time specified in the file name.
Here is an example use case of Timestamp in File Name:
- A directory containing all the sales data is located at /path/to/sales
- The directory receives a new file on daily basis: /path/to/sales/sales_2020-04-01.parquet, /path/to/sales/sales_2020-04-02.parquet, /path/to/sales/sales_2020-04-03.parquet
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_2020-04-04.parquet, the next incremental load will pick up this file since the timestamp in the file name is more recent than that of the files extracted in the previous full load.

Timestamp Formats in File Name

yyyy-MM-dd
dd.MM.yyyy
dd-MMM-yy
dd-MMM-yyyy
yyyy-MM-dd HH.mm.ss
Unix Epoch (seconds)
Unix Epoch (milliseconds)

Text File Date Format

yyyy-MM-dd
dd/MM/yyyy
dd.MM.yyyy
dd/MMM/yyyy
dd-MMM-yy
dd-MMM-yyyy
MM/dd/yyyy
yyyy/MM/dd
Unix Epoch (seconds
Unix Epoch (milliseconds)
Other

Text File Timestamp Format

yyyy-MM-dd HH:mm:ss
yyyy-MM-dd HH.mm.ss
yyyy-MM-dd HH:mm:ss.SSS
dd/MM/yyyy HH:mm:ss
dd/MM/yyyy HH.mm.ss
dd/MM/yyyy HH:mm:ss.SSS
Unix Epoch (seconds
Unix Epoch (milliseconds)
Other

Text File Character Set

US-ASCII
ISO-8859-1
UTF-8
UTF-16BE
UTF-16LE
UTF-16

Text File Separator

Comma
Tab
Other

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the list of schemas, select the AWS S3 schema.
In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the AWS S3 schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the list of schemas, select the AWS S3 schema.
In the Schema Designer, in the Action bar, select Load → Load Now → Full.
To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the AWS S3 schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps: