References → Data Extraction via Advanced SQLi and Public API

Overview

To materialize the data of business views to Parquet files in previous releases, you needed to create an Analyzer table on top of the business view. Loading the analyzer table was a time-consuming process.

By leveraging the Advanced SQL Interface and Public API, Incorta has streamlined data extraction from verified business views to different destinations in Parquet format. Supported destinations include Google Cloud Storage (GCS), Azure, AWS S3, and local destinations. However, only GCS is supported for Cloud installations.

Note

To send extracted Parquet files to GCS, Azure, or AWS S3 from On-Premises tenants, you need to make sure that the Advanced SQLi has access to the respective cloud storage. For details, refer to References → SparkX Access to Cloud Storage.

By introducing this labs feature, Incorta has made use of Spark and Advanced SQLi to ensure faster and more efficient performance. Additionally, leveraging the Public API facilitates seamless integration with various systems and applications, promoting interoperability and scalability.

Labs Features

Data Extraction via Advanced SQLi and Public API is a labs feature.

An Incorta Labs feature is experimental and functionality may produce unexpected results. For this reason, an Incorta Lab feature is not ready for use in a production environment. Incorta Support will investigate issues with an Incorta Labs feature. In a future release, an Incorta Lab feature may be either promoted to a product feature ready for use in a production environment or be deprecated without notice.

Prerequisites and requirements

The following are the prerequisites and requirements to use this feature:

Make sure Advanced SQLi is enabled and properly configured.
Set the required Cluster Management Console (CMC) configurations to specify the valid paths and schemas to extract data.
Create an access token for the admin user to authenticate the API requests.
Use the following API endpoints in the same order:
1. POST /{tenant}/extraction/schema to create a schema in the Spark Metastore. The schema must be one of the specified schemas in the CMC.
2. POST /{tenant}/extraction/table to create a new external table in the Spark Metastore, extract data from the specified business view, and save the Parquet files to the specified path. If the table exists or the path is not empty, the endpoint returns an error; however, you can instruct the endpoint to overwrite the existing table or files.
3. Afterward, you can also use the PUT /{tenant}/extraction/table endpoint to add data from the same or another view to the path of an existing external table.

CMC configurations

You must set the following configurations per tenant.

To set the configurations required to extract data from verified business views for a specific tenant, follow these steps:

Sign in to the CMC.
Select Clusters > <cluster_name> > Tenants.
On the Tenants list, for the tenant you want, select Configure.
On the tenant configurations page, select Incorta Labs.
In the Target Paths for Data Extraction box, enter the list of valid paths to save the extracted Parquet files. You can add multiple paths as a comma-separated list.
In the External Schemas for Data Extraction box, enter the schema names that can be used for data extraction from verified business views. You can add multiple schemas as a comma-separated list.
Select Save.

Note

Incorta must have read and write access to the specified paths.

For GCS destinations, you can contact Incorta Support to get an external GCS bucket for data extraction. This is available at an additional cost.

Data extraction endpoints

Incorta has introduced the following Public API endpoints for data extraction from verified business views.

Create External Schema: Creates a new external schema in the Spark Metastore for data extraction.
Extract to External Table: Creates a new external table in the Spark Metastore, extracts data from the specified business view, and saves the Parquet files to the specified path.
Append to External Table: Extracts data from the specified business view and adds the Parquet files to the path associated with an existing external table.
Delete External Table: Removes the external table definition from the Spark Metastore.
Delete External Schema: Removes the external schema definition from the Spark Metastore.

Note

To access extracted Parquet files, you can create a data source that reads from the path where the Parquet files are stored.

Limitations and known issues

This feature requires enabling the Advanced SQLi.
This feature is available for the Super Admin (and users with the SuperRole when the Super User Mode is enabled).
For Cloud installations, extraction to only GCS buckets is supported.
Extraction is only in Parquet format.
You must delete all external tables in a schema before deleting the external schema itself.
Deleting external tables or schemas only removes their definition from the Spark Maetastore. You must delete the respective Parquet files manually from the related paths.
For now, the POST /{tenant}/extraction/table and PUT /{tenant}/extraction/table endpoints can accept physical tables too; however, it is not recommended as this will be prevented soon.
Updating the source physical schemas during data extraction of related business views should be avoided as these updates might fail or get stuck. In such a case, you must update the schema again or sync the Spark Metastore. In the CMC > Clusters > <cluster_name> > Tenants > <tenant_name> > More Options (⋮), select Sync Spark Metastore. (resolved in 2024.1.5 and 2024.7.x)

Content

Overview

Prerequisites and requirements

CMC configurations

Data extraction endpoints

Limitations and known issues