Connectors → Google Cloud Storage

About Google Cloud Storage

Cloud Storage is a service for storing your data in Google Cloud. Your data can consist of a file of any format. Cloud Storage considers an object as an immutable piece of data. You store objects in containers called buckets. Your buckets are associated with a project and your projects exist under an organization. You can store and retrieve any amount of data for companies of all sizes.

Google Cloud Storage Connector Updates

This section is to explore the updates in the newer versions of the Google Cloud Storage connector available on the Incorta connectors marketplace.

In order to get the newer version of the connector, please update the connector using the marketplace.

VersionUpdates
2.0.1.8Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Google Cloud Storage connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times
Recommendation

Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.

About the Google Cloud Storage Connector

With the Google Cloud Storage (GCS) connector, you can create a data source for a Cloud Storage for files of any format.

You can access all folders and files that you have in your bucket.

The GCS connector supports the following Incorta specific functionality:

FeatureSupported
Chunking
Data Agent
Encryption at Ingest
Incremental Load
Multi-Source
OAuth
Performance Optimized
Remote
Single-Source
Spark Extraction
Webhook Callbacks

The GCS connector allows two types of authentication:

  • OAUTH
  • Service Account

OAUTH Requirements

The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and your credentials as a web application. To use GCS connector, in your Google Cloud Platform:

  • Find your account key in APIs & Services Credentials
  • Find your secret key in APIs & Services Credentials
  • Enable GCS JSON API
  • Make sure you have a bucket that contains your data

Service Account Requirements

The System Administrator who manages your organization’s GCS accounts as well as your Incorta Cluster creates your project on Google Cloud Platform and creates your credentials as a web application. To use GCS connector, in your Google Cloud Platform:

  • In IAM & Admin service account, choose your project and generate your service account JSON key
  • In the downloaded service account JSON file, find your private ID, private key, and client email

Steps to connect GCS and Incorta

To connect GCS and Incorta, here are the high level steps, tools, and procedures:

Create an external data source

Here are the steps to create an external data source with the GCS connector:

  • Sign in to the Incorta Direct Data Platform™.
  • In the Navigation bar, select Data.
  • In the Action bar, select + New → Add Data Source.
  • In the Choose a Data Source dialog, in Data lake, select Data lake-GCS.
  • In the New Data Source dialog, specify the applicable connector properties.
  • To test, select Test Connection.
  • Select Ok to save your changes.

GCS connector properties

Here are the properties for the GCS connector:

PropertyControlDescription
Name Your Data Sourcetext boxEnter a name for your data source
Project IDtext boxEnter your project ID. You can find it in your GCS Settings.
Authentication Typetext boxSelect your authentication type. The options are:
  ●  OAUTH 2.0
  ●  Service Account
To learn more, refer to OAUTH Requirements
Google OAuth2 Client IDtext boxChoose OAUTH 2.0 Authentication type to configure this property. Enter your GCS account key. You can find it in your google cloud platform, APIs & Services Credentials.
Google OAuth2 Client Secrettext boxChoose OAUTH 2.0 Authentication type to configure this property. Enter your GCS secret key. You can find it in your google cloud platform > APIs & Services Credentials.
Authorizebutton/linkClick this link to allow GCS access
Account Client EMailtext boxChoose Service Account Authentication type to configure this property. Enter your Service Account Client Email. Copy it from the service account JSON file.
Account Private Key IDtext boxChoose Service Account Authentication type to configure this property. Enter your service account private key id. Copy it from the service account JSON file.
Account Private Keytext boxChoose Service Account Authentication type to configure this property. Enter your service account private key. Copy it from the service account JSON file.
Directorytext boxEnter your GCS bucket name and path to your target folder. Example: gs://bucket-name/path/to/root/directory

Create a schema with the Schema Wizard

Here are the steps to create a GCS schema with the Schema Wizard:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the Action bar, select + New → Schema Wizard.
  • In (1) Choose a Source, specify the following:
    • For Enter a name, enter the schema name.
    • For Select a Datasource, select the GCS source.
    • Optionally create a description.
  • In the Schema Wizard footer, select Next.
  • In (2) Manage Tables, in the Data Panel, first select the name of the Data Source, and then check the Select All checkbox.
  • In the Schema Wizard footer, select Next.
  • In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create a GCS schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Create Schema.
  • In Name, specify the schema name, optionally create a description, and select Save.
  • In Start adding tables to your schema, select Data Lake.
  • In the Data Source dialog, specify the GCS table data source properties.
  • Select Add.
  • In the Table Editor, in the Table Summary section, enter the table name.
  • To save your changes, select Done in the Action bar.
GCS table data source properties

For a schema table in Incorta, you can define the following GCS specific data source properties as follows:

PropertyControlDescription
Typedrop down listDefault is Data Lake
Data Sourcedrop down listSelect the GCS external data source
RemotetoggleEnable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized option affects data accessibility.
File Typedrop down listSelect a file type option: Text (csv, tsv, tab, txt) Excel (xlsx) - not an option with Remote enabled Parquet ORC
IncrementaltoggleEnables incremental loading for the schema table
Has header?toggleThis property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row.
Rows to Skiptext boxThis property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file.
Wildcard UniontoggleEnable this property to get incremental data file updates from an existing directory
Incremental Extract Usingdrop down listEnable Incremental and Wildcard Union to configure this property. Choose the incremental extract method.
Directory Pathtext boxEnable Wildcard Union to configure this property. Enter the path to the directory, relative to the root directory configured in the data source. For example: sales/branches
Apply Include Pattern ondrop down listThis property appears when Wildcard Union is enabled. Select either:
  ●  File Name - apply pattern on all file names in the selected directory path
  ●  File Relative Path - apply pattern on relative path in the selected directory path
Includetext boxThis property appears when Wildcard Union is enabled. To include only certain files in the load process, enter a prefix to compare against:
  ●  The names of the files in a directory if Apply Include Pattern has a value of File Name. For example, entering sales* .parquet will load only files that start with the word sales and end with .parquet.
  ●  The relative path in a directory if Apply Include Pattern on has a value of File Relative Path. For example, entering sales will load files in the sales directory.
Excludetext boxThis property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded.
Include Sub-DirectoriestoggleEnable Wildcard Union to configure this property. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include Prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded.
Include Filename as a ColumntoggleEnable Wildcard Union to configure this property. Enable this property to add the file name as the first column in the schema table.
File Pathtext boxEnter the path to the data file, relative to the root directory configured in the data source. For example: sales/Q1.csv
Update Filetext boxEnter the path to update the file relative to the root directory configured in the data source. For example: sales/Q1_updates.csv
Filename columntext boxEnable Include Filename as a Column to configure this property. Enter the filename column.
Date Formatdrop down listThis property appears when the File Type is Text. Choose the date format from the available options.
Timestamp Formatdrop down listThis property appears when the File Type is Text. Choose the timestamp format from the available options.
Character Setdrop down listThis property appears when the File Type is Text. Choose the character set from the available options.
Separatordrop down listThis property appears when the File Type is Text. Choose the separator from the available options.
Enable ChunkingtoggleThis property appears when the File Type is Text. Enable this property to process the text file in chunks which reduces the extract time.
Chunk Size (MB)text boxEnter chunking size in Megabytes
CallbacktoggleEnable this property to expose the Callback URL field
Callback URLtext boxThis property appears when the Callback toggle is enabled. Specify the URL.
Summary of Data Access Methods Based on Remote and Performance Optimized Settings
Table PropertiesData Source PropertiesParquetDDMMemorySQLiMV/ NotebooksAnalytics
Performance Optimized = OffRemote = OnNoNoNoYesYesNo
Performance Optimized = OffRemote = OffYesYesNoYesYesNo, unless populated via MV/Notebook
Performance Optimized = OnRemote = OffYesYesYesYesYesYes

Incremental Extract Methods

  • Last Successful Extract Time: This option will load data from the time the last successful extract occurred.

    Here is a use case of Last Successful Extract Time:

    • A directory containing all the sales data is located at /path/to/sales
    • The directory contains the following files: /path/to/sales/sales_california.parquet, /path/to/sales/sales_newyork.parquet, /path/to/sales/illinois.parquet
  • When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_ohio.parquet, the next incremental load will pick up this file since its last modified timestamp will be more recent than that of the files extracted in the previous full load.

  • Timestamp in File Name: This option will load data from the time specified in the file name.

    Here is an example use case of Timestamp in File Name:

    • A directory containing all the sales data is located at /path/to/sales
    • The directory receives a new file on daily basis: /path/to/sales/sales_2020-04-01.parquet, /path/to/sales/sales_2020-04-02.parquet, /path/to/sales/sales_2020-04-03.parquet
  • When you perform a full load, the union of all existing files will be extracted into the same table. After that, if the directory receives a new file, such as /path/to/sales/sales_2020-04-04.parquet, the next incremental load will pick up this file since the timestamp in the file name is more recent than that of the files extracted in the previous full load.

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the GCS schema.
  • In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the GCS schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the GCS schema.
  • In the Schema Designer, in the Action bar, select Load → Load Now → Full.
  • To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the GCS schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps:

  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the List view, select the GCS schema.
  • In the Schema Designer, in the Action bar, select Explore Data.