Data Applications → Configure Incorta Intelligent Ingest for Azure Synapse
About Incorta Intelligent Ingest for Azure Synapse
Incorta Intelligent Ingest streamlines the integration between complex data sources and Microsoft Azure Synapse Analytics. Incorta Intelligent Ingest accelerates data mart deployment, and automates data model design and source schema mapping with Incorta data applications. Incorta data applications provide a new, faster, more effective option for analyzing and understanding your data, with no data modeling or extract, transform, load (ETL) required.
Following is the high-level Incorta Intelligent Ingest process:
- Connect Incorta to your data source with a data application, and load pre-existing physical schema fact and dimension tables.
- Copy the physical schema tables to the Parquet that resides on Azure Data Lake Storage (ADLS) Gen 2.
- Load Synapse data marts from the parquet.
- Visualize the data with Power BI.
Get Started with Incorta Intelligent Ingest
Following are the steps to get started with Incorta Intelligent Ingest:
- Enable Incorta Intelligent Ingest
- Configure Spark for Intelligent Ingest
- Configure Your Azure Synapse Analytics SQL Pool
- Create an Azure Synapse Data Destination
- Set a Data Destination for a Physical Schema
- Create Materialized Views to Trigger Data Ingest
- Schedule Materialized View Load Jobs
In order to use Incorta Intelligent Ingest, you must have the following:
- Incorta On-Premises version 5.1.4
- An Incorta Tenant hosted on ADLS Gen2.
When naming tenants or data destinations it is best practice to use a generic name. For example, "Incorta" rather than "IncortaDev". Doing so can be helpful when and if you need to import/export tenants or perform data destination integrations.
Enable Incorta Intelligent Ingest
For existing on-premises customers, a Systems Administrator with root access to your Incorta host can enable Intelligent Ingest by setting the intelligent.ingest.enabled
property in the Analytics and Loader service.properties
files. Here are the steps to update the service.properties
files:
Sign in to the host running your Incorta nodes.
Navigate to your
service.properties
file located in the following path:<INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/services/<GUID>/incorta/service.properties
The GUID is located in the services directory for both the Analytics and Loader services. An example of a GUID is
c3b87de2-d49e-4099-8005-cd6251a35d69
. You will need to update theservice.properties
file for each GUID.In the
service.properties
file, to enable Incorta Intelligent Ingest, set theintelligent.ingest.enabled
property totrue
:intelligent.ingest.enabled = true
To disable Incorta Intelligent Ingest, set the
intelligent.ingest.enabled
property tofalse
.
Configure Spark for Intelligent Ingest
There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. Here is the path to the configuration file: <INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf
Install Python 3.6 and Dependent Libraries
You will need to install Python version 3.6 for compatibility with the Intelligent Ingest Python package.
Install the Python libraries
Use the following commands to install the required Python libraries:
/usr/bin/python3 -m pip install --upgrade pip/usr/bin/python3 -m pip install pyodbc/usr/bin/python3 -m pip install pandas/usr/bin/python3 -m pip install pyarrow/usr/bin/python3 -m pip install azure-storage-file-datalake/usr/bin/python3 -m pip install azure-identity/usr/bin/python3 -m pip install dataclasses/usr/bin/python3 -m pip install requests
The SQL Server ODBC driver library, pyodc, may require dependent libraries. For Red Hat, you can install them with the following commands:
sudo yum install epel-releasesudo yum install python3-pip gcc-c++ python3-devel unixODBC-develpip3 install --user pyodbc
sudo su#Red Hat Enterprise Server 7 and Oracle Linux 7curl https://packages.microsoft.com/config/rhel/7/prod.repo > /etc/yum.repos.d/mssql-release.repoexitsudo yum remove unixODBC-utf16 unixODBC-utf16-devel #to avoid conflictssudo ACCEPT_EULA=Y yum install -y msodbcsql17
Create new directories for the Synapse load libraries and logs
- Create the
syn
directory in<INCORTA_INSTALLATION_PATH>/IncortaNode/
:
mkdir <INCORTA_INSTALLATION_PATH>/IncortaNode/syn
- Create the
logs
directory in<INCORTA_INSTALLATION_PATH>/IncortaNode/syn/
mkdir <INCORTA_INSTALLATION_PATH>/IncortaNode/syn/logs
Create the Synapse mappings CSV file
Within <INCORTA_INSTALLATION_PATH>/IncortaNode/syn/
, create a file named synapse_mappings.csv
with the following content:
schema_name,table_name,hash_columnEXAMPLE,EXAMPLE,EXAMPLE
Copy the Python .egg file
Copy incorta_synapse-<VERSION>.egg
to <INCORTA_INSTALLATION_PATH>/incorta.ml/lib/
You can obtain the latest version of the .egg file from Incorta support.
Within the Cluster Management Console (CMC), restart Spark and the Incorta services, including the Analytics and Loader services, after you copy the .egg file.
Update the Spark environment properties
Update the Python Path in the CMC for your tenant or default tenant configuration.
Update the Python Path for a Tenant Configuration
Here are the steps to specify the Python Path for a specific tenant:
- Sign in to the CMC.
- In the Navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Tenants.
- For the given tenant, select Configure.
- In the left pane, select Advanced.
- In the right pane, for Python Path, enter
python3
. - Select Save.
Update the Python Path for the Default Tenant Configuration
Here are the steps to specify the Python Path for the Default Tenant Configuration:
- Sign in to the CMC.
- In the Navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Cluster Configurations.
- In the panel tabs, select Default Tenant Configurations.
- In the left pane, select Advanced.
- In the right pane, for Python Path, enter
python3
. - Select Save.
Update the Spark configuration properties
There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. This file requires several parameter updates. Here is the path to the configuration file: <INCORTA_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf
The required Intelligent Ingest properties for the Spark configuration file are as follows, and are discussed in more detail below:
spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>spark.ADLS.filesystem=<CONTAINER_NAME>spark.ADLS.rootdirectory=<TENANT_DIRECTORY_PATH>spark.Incorta.host=http://127.0.0.1:8080/incortaspark.Incorta.tenantname=<TENANT_NAME>spark.Incorta.user=<USER_NAME>spark.Incorta.password=<PASSWORD>spark.Synapse.mappings.csv=<INCORTA_INSTALLATION_PATH>/syn/synapse_mappings.csvspark.Synapse.logs.path=<INCORTA_INSTALLATION_PATH>/syn/logs
Update the Spark ADLS parameters
Update the following Spark ADLS parameters:
spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>spark.ADLS.filesystem=<CONTAINER_NAME>spark.ADLS.rootdirectory=<DIRECTORY_PATH>
You can use the CMC tenant path to identify the property values.
abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<DIRECTORY_PATH>
For example, with a CMC tenant path of
abfs://storage@incortadatalakegen2.dfs.core.windows.net/incorta/Tenants/ebs_demo
the Spark ADLS parameters are
spark.ADLS.accountname=incortadatalakesaspark.ADLS.filesystem=storagespark.ADLS.rootdirectory=incorta/Tenants/ebs_demo
Update the Incorta API parameters
To access the Incorta APIs used by Intelligent Ingest, you will need to specify the Incorta host URL, tenant name, user name that has access to the schemas, and the password. Add the following parameters to the configuration file, replacing the example host URL, tenant name, user name, and password with the values for your instance:
#Incorta APIsspark.Incorta.host=http://127.0.0.1:8080/incortaspark.Incorta.tenantname=<TENANT_NAME>spark.Incorta.user=<USER_NAME>spark.Incorta.password=<PASSWORD>
Add the path to the logs directory
#Log directory pathspark.Synapse.logs.path=<INCORTA_INSTALLATION_PATH>/IncortaNode/syn/logs
Add the path to the Synapse mappings CSV file
spark.Synapse.mappings.csv=<INCORTA_INSTALLATION_PATH>/syn/synapse_mappings.csv
Optionally update the connection string parameters
The following Synapse connection string parameters are optional when you use the latest version of the egg file (2.0.2 or later):
#Synapse connection string parametersspark.Synapse.driver={ODBC Driver 17 for SQL Server}spark.Synapse.mars.connection.enabled=yesspark.Synapse.application.name=IncortaIntelligentIngestspark.Synapse.targetschemaname=<I_SCHEMA>spark.Synapse.targettablename=<I_TABLE>spark.Synapse.copy.auth = principal
spark.Synapse.targetschemaname
: Overrides the schema name in Synapsespark.Synapse.targettablename
: Appends a prefix to the table name when the target schema is overridden For example,
sets the destination table asspark.Synapse.targettablename=<I_SCHEMA>_<I_TABLE>targetschemaname.incortaschemaname_tablenamespark.Synapse.copy.auth:
Controls the ADLS authentication mechanism in the COPY statements. The possible values are:- principal(default): Use the service principal that uses the same credentials as in
core-site.xml
. - managed: Use Managed Identity authentication in Synapse and the ADLS storage account.
- principal(default): Use the service principal that uses the same credentials as in
Optionally change the Synapse table hash distribution key
All tables with keys will use the Hash distribution option. By default, it uses the first key. If you want to change the key, you can add the following to the synapse_mappings.csv
file:
<SCHEMA_NAME>,<TABLE_NAME>,<HASH_COLUMN>
Configure Your Azure Synapse Analytics SQL Pool
Perform the following configuration steps in your Azure Synapse Analytics SQL Pool before you load data into Synapse from Incorta:
- Create the Incorta login for your SQL pool for the master database
- Create the Incorta user with grants for the SQL pool database
It is assumed that you already have a Synapse workspace and dedicated SQL Pool from the Azure blade.
Create the Incorta login for your SQL pool for the master database
Here are the steps to create an Incorta login for your SQL pool for the master database. You must replace <PASSWORD> with a secure password.
- In Synapse Analytics, for your workspace, select New → SQL script.
- In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select master.
- In the SQL editor, run the following CREATE LOGIN command, replacing <PASSWORD> with a secure password for the new login.
CREATE LOGIN incorta WITH PASSWORD = '<PASSWORD>';
You will need to provide this username and password when you create an Azure Synapse Data Destination.
Create the Incorta user with grants for the SQL pool database
Here are the steps to create an Incorta user with grants for the selected SQL pool database:
- In Synapse Analytics, in Develop, select + → SQL script.
- In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select your SQL pool database.
- In the SQLeditor canvas, run the following SQL commands:
CREATE USER incorta FOR LOGIN incorta;EXEC sp_addrolemember 'db_ddladmin', 'incorta';EXEC sp_addrolemember 'db_datawriter', 'incorta';EXEC sp_addrolemember 'db_datareader', 'incorta';GRANT ALTER ANY EXTERNAL FILE FORMAT to incorta;GRANT ADMINISTER DATABASE BULK OPERATIONS to incorta;
Create an Azure Synapse Data Destination
When you create an Azure Synapse Data Destination, you are able to seamlessly ingest physical schema tables from Incorta to your Azure Synapse instance.
Here are the steps:
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Destination.
- In the Choose a Data Destination dialog, in Schema Destination select Synapse Analytics.
- In the New Data Destination dialog, specify the following Asure Synapse schema destination properties:
- SQL Pool Endpoint: This is the dedicated SQL endpoint
- Database Name: This is the SQL pool name
- SQL Username: This is the username created in the Create the Incorta login for your SQL pool for the master database section above.
- SQL Password: This is the password created in the Create the Incorta login for your SQL pool for the master database section above.
- Port: The default value is 1433
- To test the connection, select Test Connection.
- To save, select Ok.
For more information on how to manage Data Destinations, refer to Tools → Data Manager.
Set a Data Destination for a Physical Schema
Set a data destination for physical schemas that will push tables to Azure Synapse. Here are the steps:
- In the Navigation bar, select Schema.
- In the Action bar, select Settings (gear icon) → Set Data Destination.
- In the Set Schema Destination for <SCHEMA_NAME> dialog,
- Select a Schema Destination or select + New Schema Data Destination to create a new one.
- Select Done.
A Synapse Analytics icon will appear to the left of the schema name to indicate a data destination has been set for the schema.
Create Materialized Views to Trigger Data Ingest
For each physical schema that will trigger data ingest into Azure Synapse, create a materialized view. Each materialized view should contain the following lines of code for full and incremental load scripts:
Script (full load script)
from incorta_synapse import load_to_synapsedf = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Full")save (df)
Incremental Script (incremental load script)
from incorta_synapse import load_to_synapsedf = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Incremental")save (df)
These examples pass the schema name as a parameter in the load script. There is also an option to pass a list of tables in the same schema as a comma separated list of fully qualified names as follows:
df = load_to_synapse([<PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_1>, <PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_2>], "Full")
Schedule Materialized View Load Jobs
You can schedule the materialized views that trigger data ingest into Azure Synapse to run in a specific order. For more information, refer to Tools → Scheduler.
The physical schema should be separate from the Synapse load schemas to support scheduling the load jobs.
Limitations
Intelligent Ingest limitations
- If you want to perform an incremental table load, the incremental column name must be LAST_UPDATE_DATE .
- Encrypted columns are not supported and will be ingested into Synapse as a string, as it exists in Parquet.
- SSO/LDAP users are not supported. The Python script uses only Incorta internal users for authentication.
- For incremental data loading, deduplication is only supported across multiple load jobs, as the solution is built on an assumption that a single load job does not have duplicates.
- If you have multiple tenants in Incorta, Synapse will be supported for a single tenant only.
- Tenant and schema export/import for data destinations are not supported.
Synapse limitations
Hashing, the methodology used for deduplication and key definition, supports a single key only. Composite keys are not supported.
A client error will be thrown if a date or timestamp column with a value less than 1.1.1970 is ingested.
The MV used to trigger ingest into Synapse may be stuck in the Allocating resource phase in Incorta due to one of the following reasons:
Multiple jobs are submitted to Synapse and are stuck in queue. Synapse will not terminate these jobs. You can use the following query to check the jobs in queue:
SELECT * FROM sys.dm_pdw_exec_requests WHERE status='Running' ;A large data volume is ingested while the SQL pool performance is small.