Blueprints → Configure Incorta Intelligent Ingest for Azure Synapse

About Incorta Intelligent Ingest for Azure Synapse

Incorta Intelligent Ingest streamlines the integration between complex data sources and Microsoft Azure Synapse Analytics. Incorta Intelligent Ingest accelerates data mart deployment, and automates data model design and source schema mapping with Incorta blueprints. Incorta blueprints provide a new, faster, more effective option for analyzing and understanding your data, with no data modeling or extract, transform, load (ETL) required.

Following is the high-level Incorta Intelligent Ingest process:

  • Connect Incorta to your data source with a blueprint, and load pre-existing physical schema fact and dimension tables.
  • Copy the physical schema tables to the Parquet that resides on Azure Data Lake Storage (ADLS) Gen 2.
  • Load Synapse data marts from the parquet.
  • Visualize the data with Power BI.

Get Started with Incorta Intelligent Ingest

Following are the steps to get started with Incorta Intelligent Ingest:

Intelligent Ingest Prerequisites

In order to use Incorta Intelligent Ingest, you must have the following:

Enable Incorta Intelligent Ingest

For existing on-premises customers, a Systems Administrator with root access to your Incorta host can enable Intelligent Ingest by setting the intelligent.ingest.enabled property in the Analytics and Loader service.properties files. Here are the steps to update the service.properties files:

  • Sign in to the host running your Incorta nodes.

  • Navigate to your service.properties file located in the following path: <INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/services/<GUID>/incorta/service.properties

    The GUID is located in the services directory for both the Analytics and Loader services. An example of a GUID is c3b87de2-d49e-4099-8005-cd6251a35d69. You will need to update the service.properties file for each GUID.

  • In the service.properties file, to enable Incorta Intelligent Ingest, set the intelligent.ingest.enabled property to true:

    intelligent.ingest.enabled = true

    To disable Incorta Intelligent Ingest, set the intelligent.ingest.enabled property to false.

Configure Spark for Intelligent Ingest

There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. Here is the path to the configuration file: <INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf

Install Python 3.6 and Dependent Libraries

You will need to install Python version 3.6 for compatibility with the Intelligent Ingest Python package.

Install the Python libraries

Use the following commands to install the required Python libraries:

/usr/bin/python3 -m pip install --upgrade pip
/usr/bin/python3 -m pip install pyodbc
/usr/bin/python3 -m pip install pandas
/usr/bin/python3 -m pip install pyarrow
/usr/bin/python3 -m pip install azure-storage-file-datalake
/usr/bin/python3 -m pip install azure-identity
/usr/bin/python3 -m pip install dataclasses
/usr/bin/python3 -m pip install requests
Note

The SQL Server ODBC driver library, pyodc, may require dependent libraries. For Red Hat, you can install them with the following commands:

sudo yum install epel-release
sudo yum install python3-pip gcc-c++ python3-devel unixODBC-devel
pip3 install --user pyodbc
sudo su
#Red Hat Enterprise Server 7 and Oracle Linux 7
curl https://packages.microsoft.com/config/rhel/7/prod.repo > /etc/yum.repos.d/mssql-release.repo
exit
sudo yum remove unixODBC-utf16 unixODBC-utf16-devel #to avoid conflicts
sudo ACCEPT_EULA=Y yum install -y msodbcsql17
Create new directories for the Synapse load libraries and logs
  • Create the syn directory in <INCORTA_INSTALLATION_PATH>/IncortaNode/:
mkdir <INCORTA_INSTALLATION_PATH>/IncortaNode/syn
  • Create the logs directory in <INCORTA_INSTALLATION_PATH>/IncortaNode/syn/
mkdir <INCORTA_INSTALLATION_PATH>/IncortaNode/syn/logs
Create the Synapse mappings CSV file

Within <INCORTA_INSTALLATION_PATH>/IncortaNode/syn/, create a file named synapse_mappings.csv with the following content:

schema_name,table_name,hash_column
EXAMPLE,EXAMPLE,EXAMPLE
Copy the Python .egg file

Copy incorta_synapse-<VERSION>.egg to <INCORTA_INSTALLATION_PATH>/incorta.ml/lib/

You can obtain the latest version of the .egg file from Incorta support.

Note

Within the Cluster Management Console (CMC), restart Spark and the Incorta services, including the Analytics and Loader services, after you copy the .egg file.

Update the Spark environment properties

Update the Python Path in the CMC for your tenant or default tenant configuration.

Update the Python Path for a Tenant Configuration

Here are the steps to specify the Python Path for a specific tenant:

  • Sign in to the CMC.
  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Tenants.
  • For the given tenant, select Configure.
  • In the left pane, select Advanced.
  • In the right pane, for Python Path, enter python3.
  • Select Save.

Update the Python Path for the Default Tenant Configuration

Here are the steps to specify the Python Path for the Default Tenant Configuration:

  • Sign in to the CMC.
  • In the Navigation bar, select Clusters.
  • In the cluster list, select a Cluster name.
  • In the canvas tabs, select Cluster Configurations.
  • In the panel tabs, select Default Tenant Configurations.
  • In the left pane, select Advanced.
  • In the right pane, for Python Path, enter python3.
  • Select Save.
Update the Spark configuration properties

There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. This file requires several parameter updates. Here is the path to the configuration file: <INCORTA_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf

The required Intelligent Ingest properties for the Spark configuration file are as follows, and are discussed in more detail below:

spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>
spark.ADLS.filesystem=<CONTAINER_NAME>
spark.ADLS.rootdirectory=<TENANT_DIRECTORY_PATH>
spark.Incorta.host=http://127.0.0.1:8080/incorta
spark.Incorta.tenantname=<TENANT_NAME>
spark.Incorta.user=<USER_NAME>
spark.Incorta.password=<PASSWORD>
spark.Synapse.mappings.csv=<INCORTA_INSTALLATION_PATH>/syn/synapse_mappings.csv
spark.Synapse.logs.path=<INCORTA_INSTALLATION_PATH>/syn/logs
Update the Spark ADLS parameters

Update the following Spark ADLS parameters:

spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>
spark.ADLS.filesystem=<CONTAINER_NAME>
spark.ADLS.rootdirectory=<DIRECTORY_PATH>

You can use the CMC tenant path to identify the property values.

abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<DIRECTORY_PATH>

For example, with a CMC tenant path of

abfs://storage@incortadatalakegen2.dfs.core.windows.net/incorta/Tenants/ebs_demo

the Spark ADLS parameters are

spark.ADLS.accountname=incortadatalakesa
spark.ADLS.filesystem=storage
spark.ADLS.rootdirectory=incorta/Tenants/ebs_demo
Update the Incorta API parameters

To access the Incorta APIs used by Intelligent Ingest, you will need to specify the Incorta host URL, tenant name, user name that has access to the schemas, and the password. Add the following parameters to the configuration file, replacing the example host URL, tenant name, user name, and password with the values for your instance:

#Incorta APIs
spark.Incorta.host=http://127.0.0.1:8080/incorta
spark.Incorta.tenantname=<TENANT_NAME>
spark.Incorta.user=<USER_NAME>
spark.Incorta.password=<PASSWORD>
Add the path to the logs directory
#Log directory path
spark.Synapse.logs.path=<INCORTA_INSTALLATION_PATH>/IncortaNode/syn/logs
Add the path to the Synapse mappings CSV file
spark.Synapse.mappings.csv=<INCORTA_INSTALLATION_PATH>/syn/synapse_mappings.csv
Optionally update the connection string parameters

The following Synapse connection string parameters are optional when you use the latest version of the egg file (2.0.2 or later):

#Synapse connection string parameters
spark.Synapse.driver={ODBC Driver 17 for SQL Server}
spark.Synapse.mars.connection.enabled=yes
spark.Synapse.application.name=IncortaIntelligentIngest
spark.Synapse.targetschemaname=<I_SCHEMA>
spark.Synapse.targettablename=<I_TABLE>
spark.Synapse.copy.auth = principal
  • spark.Synapse.targetschemaname: Overrides the schema name in Synapse
  • spark.Synapse.targettablename: Appends a prefix to the table name when the target schema is overridden For example,
    spark.Synapse.targettablename=<I_SCHEMA>_<I_TABLE>
    sets the destination table as
    targetschemaname.incortaschemaname_tablename
  • spark.Synapse.copy.auth: Controls the ADLS authentication mechanism in the COPY statements. The possible values are:
    • principal(default): Use the service principal that uses the same credentials as in core-site.xml.
    • managed: Use Managed Identity authentication in Synapse and the ADLS storage account.
Optionally change the Synapse table hash distribution key

All tables with keys will use the Hash distribution option. By default, it uses the first key. If you want to change the key, you can add the following to the synapse_mappings.csv file:

<SCHEMA_NAME>,<TABLE_NAME>,<HASH_COLUMN>

Configure Your Azure Synapse Analytics SQL Pool

Perform the following configuration steps in your Azure Synapse Analytics SQL Pool before you load data into Synapse from Incorta:

Note

It is assumed that you already have a Synapse workspace and dedicated SQL Pool from the Azure blade.

Create the Incorta login for your SQL pool for the master database

Here are the steps to create an Incorta login for your SQL pool for the master database. You must replace <PASSWORD> with a secure password.

  • In Synapse Analytics, for your workspace, select NewSQL script.
  • In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select master.
  • In the SQL editor, run the following CREATE LOGIN command, replacing <PASSWORD> with a secure password for the new login.
CREATE LOGIN incorta WITH PASSWORD = '<PASSWORD>';
Important

You will need to provide this username and password when you create an Azure Synapse Data Destination.

Create the Incorta user with grants for the SQL pool database

Here are the steps to create an Incorta user with grants for the selected SQL pool database:

  • In Synapse Analytics, in Develop, select +SQL script.
  • In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select your SQL pool database.
  • In the SQLeditor canvas, run the following SQL commands:
CREATE USER incorta FOR LOGIN incorta;
EXEC sp_addrolemember 'db_ddladmin', 'incorta';
EXEC sp_addrolemember 'db_datawriter', 'incorta';
EXEC sp_addrolemember 'db_datareader', 'incorta';
GRANT ALTER ANY EXTERNAL FILE FORMAT to incorta;
GRANT ADMINISTER DATABASE BULK OPERATIONS to incorta;

Create an Azure Synapse Data Destination

When you create an Azure Synapse Data Destination, you are able to seamlessly ingest physical schema tables from Incorta to your Azure Synapse instance.

Here are the steps:

  • ​​In the Navigation bar, select Data.
  • In the Action bar, select + NewAdd Data Destination.
  • In the Choose a Data Destination dialog, in Schema Destination select Synapse Analytics.
  • In the New Data Destination dialog, specify the following Asure Synapse schema destination properties:
  • To test the connection, select Test Connection.
  • To save, select Ok.

For more information on how to manage Data Destinations, refer to Tools → Data Manager.

Set a Data Destination for a Physical Schema

Set a data destination for physical schemas that will push tables to Azure Synapse. Here are the steps:

  • ​​In the Navigation bar, select Schema.
  • In the Action bar, select Settings (gear icon) → Set Data Destination.
  • In the Set Schema Destination for <SCHEMA_NAME> dialog,
    • Select a Schema Destination or select + New Schema Data Destination to create a new one.
    • Select Done.

A Synapse Analytics icon will appear to the left of the schema name to indicate a data destination has been set for the schema.

Create Materialized Views to Trigger Data Ingest

For each physical schema that will trigger data ingest into Azure Synapse, create a materialized view. Each materialized view should contain the following lines of code for full and incremental load scripts:

Script (full load script)

from incorta_synapse import load_to_synapse
df = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Full")
save (df)

Incremental Script (incremental load script)

from incorta_synapse import load_to_synapse
df = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Incremental")
save (df)
Note

These examples pass the schema name as a parameter in the load script. There is also an option to pass a list of tables in the same schema as a comma separated list of fully qualified names as follows:

df = load_to_synapse([<PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_1>, <PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_2>], "Full")

Schedule Materialized View Load Jobs

You can schedule the materialized views that trigger data ingest into Azure Synapse to run in a specific order. For more information, refer to Tools → Scheduler.

Note

The physical schema should be separate from the Synapse load schemas to support scheduling the load jobs.

Limitations

Intelligent Ingest limitations

  • If you want to perform an incremental table load, the incremental column name must be LAST_UPDATE_DATE .
  • Encrypted columns are not supported and will be ingested into Synapse as a string, as it exists in Parquet.
  • SSO/LDAP users are not supported. The Python script uses only Incorta internal users for authentication.
  • For incremental data loading, deduplication is only supported across multiple load jobs, as the solution is built on an assumption that a single load job does not have duplicates.
  • If you have multiple tenants in Incorta, Synapse will be supported for a single tenant only.
  • Tenant and schema export/import for data destinations are not supported.

Synapse limitations

  • Hashing, the methodology used for deduplication and key definition, supports a single key only. Composite keys are not supported.

  • A client error will be thrown if a date or timestamp column with a value less than 1.1.1970 is ingested.

  • The MV used to trigger ingest into Synapse may be stuck in the Allocating resource phase in Incorta due to one of the following reasons:

    • Multiple jobs are submitted to Synapse and are stuck in queue. Synapse will not terminate these jobs. You can use the following query to check the jobs in queue:

      SELECT * FROM sys.dm_pdw_exec_requests WHERE status='Running' ;
    • A large data volume is ingested while the SQL pool performance is small.