Connectors → Google Cloud Storage

About Google Cloud Storage

Cloud Storage is a service for storing your data in Google Cloud. Your data can consist of a file of any format. Cloud Storage considers an object as an immutable piece of data. You store objects in containers called buckets. Your buckets are associated with a project and your projects exist under an organization. You can store and retrieve any amount of data for companies of all sizes.

Google Cloud Storage Connector Updates

This section is to explore the updates in the newer versions of the Google Cloud Storage connector available on the Incorta connectors marketplace.

In order to get the newer version of the connector, please update the connector using the marketplace.

Version	Updates
2.0.1.8	Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Google Cloud Storage connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times

Recommendation

Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.

About the Google Cloud Storage Connector

With the Google Cloud Storage (GCS) connector, you can create a data source for a Cloud Storage for files of any format.

You can access all folders and files that you have in your bucket.

The GCS connector supports the following Incorta specific functionality:

Feature	Supported
Chunking
Data Agent
Encryption at Ingest
Incremental Load	✔
Multi-Source	✔
OAuth	✔
Performance Optimized	✔
Remote	✔
Single-Source	✔
Spark Extraction
Webhook Callbacks	✔

The GCS connector allows two types of authentication:

OAUTH
Service Account

OAUTH Requirements

The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and your credentials as a web application. To use GCS connector, in your Google Cloud Platform:

Find your account key in APIs & Services Credentials
Find your secret key in APIs & Services Credentials
Enable GCS JSON API
Make sure you have a bucket that contains your data

Service Account Requirements

The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and creates your credentials as a web application. To use GCS connector, in your Google Cloud Platform:

In IAM & Admin service account, choose your project and generate your service account JSON key
In the downloaded service account JSON file, find your private ID, private key, and client email

Steps to connect GCS and Incorta

To connect GCS and Incorta, here are the high level steps, tools, and procedures:

Create an external data source
Create a schema with the Schema Wizard
or, Create a schema with the Schema Designer
Load the schema
Explore the schema

Create an external data source

Here are the steps to create an external data source with the GCS connector:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Data.
In the Action bar, select + New → Add Data Source.
In the Choose a Data Source dialog, in Data lake, select Data lake-GCS.
In the New Data Source dialog, specify the applicable connector properties.
To test, select Test Connection.
Select Ok to save your changes.

GCS connector properties

Here are the properties for the GCS connector:

Property	Control	Description
Name Your Data Source	text box	Enter a name for your data source
Project ID	text box	Enter your project ID. You can find it in your GCS Settings.
Authentication Type	text box	Select your authentication type. The options are: ● OAUTH 2.0 ● Service Account To learn more, refer to OAUTH Requirements
Google OAuth2 Client ID	text box	Choose OAUTH 2.0 Authentication type to configure this property. Enter your GCS account key. You can find it in your google cloud platform, APIs & Services Credentials.
Google OAuth2 Client Secret	text box	Choose OAUTH 2.0 Authentication type to configure this property. Enter your GCS secret key. You can find it in your google cloud platform > APIs & Services Credentials.
Authorize	button/link	Click this link to allow GCS access
Account Client EMail	text box	Choose Service Account Authentication type to configure this property. Enter your Service Account Client Email. Copy it from the service account JSON file.
Account Private Key ID	text box	Choose Service Account Authentication type to configure this property. Enter your service account private key id. Copy it from the service account JSON file.
Account Private Key	text box	Choose Service Account Authentication type to configure this property. Enter your service account private key. Copy it from the service account JSON file.
Directory	text box	Enter your GCS bucket name and path to your target folder. Example: `gs://bucket-name/path/to/root/directory`

Create a schema with the Schema Wizard

Here are the steps to create a GCS schema with the Schema Wizard:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the Schema Manager, in the Action bar, select + New → Schema Wizard.
In (1) Choose a Source, specify the following:
- For Enter a name, enter the schema name.
- For Select a Datasource, select the GCS source.
- Optionally create a description.
In the Schema Wizard footer, select Next.
In (2) Manage Tables, in the Data Panel, first select the name of the Data Source, and then check the Select All checkbox.
In the Schema Wizard footer, select Next.
In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create a GCS schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the Action bar, select + New → Create Schema.
In Name, specify the schema name, optionally create a description, and select Save.
In Start adding tables to your schema, select Data Lake.
In the Data Source dialog, specify the GCS table data source properties.
Select Add.
In the Table Editor, in the Table Summary section, enter the table name.
To save your changes, select Done in the Action bar.

GCS table data source properties

For a schema table in Incorta, you can define the following GCS specific data source properties as follows:

Property	Control	Description
Type	drop down list	Default is Data Lake
Data Source	drop down list	Select the GCS external data source
Remote	toggle	Enable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized option affects data accessibility.
File Type	drop down list	Select a file type option: Text (csv, tsv, tab, txt) Excel (xlsx) - not an option with Remote enabled Parquet ORC
Incremental	toggle	Enables incremental loading for the schema table
Has header?	toggle	This property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row.
Rows to Skip	text box	This property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file.
Wildcard Union	toggle	Enable this property to get incremental data file updates from an existing directory
Incremental Extract Using	drop down list	Enable Incremental and Wildcard Union to configure this property. Choose the incremental extract method.
Directory Path	text box	Enable Wildcard Union to configure this property. Enter the path to the directory, relative to the root directory configured in the data source. For example: `sales/branches`
Apply Include Pattern on	drop down list	This property appears when Wildcard Union is enabled. Select either: ● File Name - apply pattern on all file names in the selected directory path ● File Relative Path - apply pattern on relative path in the selected directory path
Include	text box	This property appears when Wildcard Union is enabled. To include only certain files in the load process, enter a prefix to compare against: ● The names of the files in a directory if Apply Include Pattern has a value of File Name. For example, entering `sales* .parquet` will load only files that start with the word sales and end with .parquet. ● The relative path in a directory if Apply Include Pattern on has a value of File Relative Path. For example, entering sales will load files in the sales directory.
Exclude	text box	This property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded.
Include Sub-Directories	toggle	Enable Wildcard Union to configure this property. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include Prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded.
Include Filename as a Column	toggle	Enable Wildcard Union to configure this property. Enable this property to add the file name as the first column in the schema table.
File Path	text box	Enter the path to the data file, relative to the root directory configured in the data source. For example: `sales/Q1.csv`
Update File	text box	Enter the path to update the file relative to the root directory configured in the data source. For example: `sales/Q1_updates.csv`
Filename column	text box	Enable Include Filename as a Column to configure this property. Enter the filename column.
Date Format	drop down list	This property appears when the File Type is Text. Choose the date format from the available options.
Timestamp Format	drop down list	This property appears when the File Type is Text. Choose the timestamp format from the available options.
Character Set	drop down list	This property appears when the File Type is Text. Choose the character set from the available options.
Separator	drop down list	This property appears when the File Type is Text. Choose the separator from the available options.
Enable Chunking	toggle	This property appears when the File Type is Text. Enable this property to process the text file in chunks which reduces the extract time.
Chunk Size (MB)	text box	Enter chunking size in Megabytes
Callback	toggle	Enable this property to expose the Callback URL field
Callback URL	text box	This property appears when the Callback toggle is enabled. Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table Properties	Data Source Properties	Parquet	DDM	Memory	SQLi	MV/ Notebooks	Analytics
Performance Optimized = Off	Remote = On	No	No	No	Yes	Yes	No
Performance Optimized = Off	Remote = Off	Yes	Yes	No	Yes	Yes	No, unless populated via MV/Notebook
Performance Optimized = On	Remote = Off	Yes	Yes	Yes	Yes	Yes	Yes

Incremental Extract Methods

Last Successful Extract Time: This option will load data from the time the last successful extract occurred.
Here is a use case of Last Successful Extract Time:
- A directory containing all the sales data is located at /path/to/sales
- The directory contains the following files: /path/to/sales/sales_california.parquet, /path/to/sales/sales_newyork.parquet, /path/to/sales/illinois.parquet
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_ohio.parquet, the next incremental load will pick up this file since its last modified timestamp will be more recent than that of the files extracted in the previous full load.
Timestamp in File Name: This option will load data from the time specified in the file name.
Here is an example use case of Timestamp in File Name:
- A directory containing all the sales data is located at /path/to/sales
- The directory receives a new file on daily basis: /path/to/sales/sales_2020-04-01.parquet, /path/to/sales/sales_2020-04-02.parquet, /path/to/sales/sales_2020-04-03.parquet
When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_2020-04-04.parquet, the next incremental load will pick up this file since the timestamp in the file name is more recent than that of the files extracted in the previous full load.

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the list of schemas, select the GCS schema.
In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the GCS schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform.
In the Navigation bar, select Schema.
In the list of schemas, select the GCS schema.
In the Schema Designer, in the Action bar, select Load → Load Now → Full.
To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the GCS schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps: