Data Applications → Configure Incorta Intelligent Ingest for Azure Synapse
About Incorta Intelligent Ingest for Azure Synapse
Incorta Intelligent Ingest streamlines the integration between complex data sources and Microsoft Azure Synapse Analytics. Incorta Intelligent Ingest accelerates data mart deployment, and automates data model design and source schema mapping with Incorta data applications. Incorta data applications provide a new, faster, more effective option for analyzing and understanding your data, with no data modeling or extract, transform, load (ETL) required.
Following is the high-level Incorta Intelligent Ingest process:
- Connect Incorta to your data source with a data application, and load pre-existing physical schema fact and dimension tables.
- Copy the physical schema tables to the Parquet that resides on Azure Data Lake Storage (ADLS) Gen 2.
- Load Synapse data marts from the parquet.
- Visualize the data with Power BI.
Get Started with Incorta Intelligent Ingest
Following are the steps to get started with Incorta Intelligent Ingest:
- Enable Incorta Intelligent Ingest
- Configure Spark for Intelligent Ingest
- Configure Your Azure Synapse Analytics SQL Pool
- Create an Azure Synapse Data Destination
- Set a Data Destination for a Physical Schema
- Create Materialized Views to Trigger Data Ingest
- Schedule Materialized View Load Jobs
In order to use Incorta Intelligent Ingest, you must have the following:
- Incorta On-Premises version 5.2.1
- An Incorta Tenant hosted on ADLS Gen2.
When naming tenants or data destinations it is best practice to use a generic name. For example, "Incorta" rather than "IncortaDev". Doing so can be helpful when and if you need to import/export tenants or perform data destination integrations.
Enable Incorta Intelligent Ingest
For existing on-premises customers, a Systems Administrator with root access to your Incorta host can enable Intelligent Ingest by setting the intelligent.ingest.enabled
property in the Analytics and Loader service.properties
files. Here are the steps to update the service.properties
files:
Sign in to the host running your Incorta nodes.
Navigate to your
service.properties
file located in the following path:<INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/services/<GUID>/incorta/service.properties
The GUID is located in the services directory for both the Analytics and Loader services. An example of a GUID is
c3b87de2-d49e-4099-8005-cd6251a35d69
. You will need to update theservice.properties
file for each GUID.In the
service.properties
file, to enable Incorta Intelligent Ingest, set theintelligent.ingest.enabled
property totrue
:intelligent.ingest.enabled = true
To disable Incorta Intelligent Ingest, set the
intelligent.ingest.enabled
property tofalse
.
Configure Spark for Intelligent Ingest
There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. Here is the path to the configuration file: <INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf
Install Python 3.6 and Dependent Libraries
You will need to install Python version 3.6 for compatibility with the Intelligent Ingest Python package.
Update the package installer for Python
Use the following commands to update the package installer for Python:
/usr/bin/python3 -m pip install --upgrade pip --user
Install the SQL Server ODBC driver
The SQL Server ODBC driver library, pyodc, may require dependent libraries. For Red Hat, you can install them with the following commands:
sudo yum install epel-releasesudo yum install python3-pip gcc-c++ python3-devel unixODBC-develpip3 install --user pyodbc
sudo su#Red Hat Enterprise Server 7 and Oracle Linux 7curl https://packages.microsoft.com/config/rhel/7/prod.repo > /etc/yum.repos.d/mssql-release.repoexitsudo yum remove unixODBC-utf16 unixODBC-utf16-devel #to avoid conflictssudo ACCEPT_EULA=Y yum install -y msodbcsql17
Install the Intelligent Ingest Wheel
Run the following command to install the Intelligent Ingest wheel file:
/usr/bin/python3 -m pip install <INCORTA_NODE_INSTALLATION_PATH>/IncortaNode/Synapse/pkg/incorta_synapse-2.0.5-py3-none-any.whl
Update the Spark environment properties
Update the Python Path in the CMC for your tenant or default tenant configuration.
Update the Python Path for a Tenant Configuration
Here are the steps to specify the Python Path for a specific tenant:
- Sign in to the CMC.
- In the Navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Tenants.
- For the given tenant, select Configure.
- In the left pane, select Advanced.
- In the right pane, for Python Path, enter
python3
. - Select Save.
Update the Python Path for the Default Tenant Configuration
Here are the steps to specify the Python Path for the Default Tenant Configuration:
- Sign in to the CMC.
- In the Navigation bar, select Clusters.
- In the cluster list, select a Cluster name.
- In the canvas tabs, select Cluster Configurations.
- In the panel tabs, select Default Tenant Configurations.
- In the left pane, select Advanced.
- In the right pane, for Python Path, enter
python3
. - Select Save.
Update the Spark configuration properties
There is a Spark configuration file read by the materialized views that trigger the ingest to Azure Synapse. This file requires several parameter updates. Here is the path to the configuration file: <INCORTA_INSTALLATION_PATH>/IncortaNode/spark/conf/spark-defaults.conf
The required Intelligent Ingest properties for the Spark configuration file are as follows, and are discussed in more detail below:
spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>spark.ADLS.filesystem=<CONTAINER_NAME>spark.ADLS.rootdirectory=<TENANT_DIRECTORY_PATH>spark.Incorta.host=http://127.0.0.1:8080/incortaspark.Incorta.tenantname=<TENANT_NAME>spark.Incorta.user=<USER_NAME>spark.Incorta.password=<PASSWORD>
Update the Spark ADLS parameters
Update the following Spark ADLS parameters:
spark.ADLS.accountname=<STORAGE_ACCOUNT_NAME>spark.ADLS.filesystem=<CONTAINER_NAME>spark.ADLS.rootdirectory=<DIRECTORY_PATH>
You can use the CMC tenant path to identify the property values.
abfs://<CONTAINER_NAME>@<STORAGE_ACCOUNT_NAME>.dfs.core.windows.net/<DIRECTORY_PATH>
For example, with a CMC tenant path of
abfs://storage@incortadatalakegen2.dfs.core.windows.net/incorta/Tenants/ebs_demo
the Spark ADLS parameters are
spark.ADLS.accountname=incortadatalakesaspark.ADLS.filesystem=storagespark.ADLS.rootdirectory=incorta/Tenants/ebs_demo
Update the Incorta API parameters
To access the Incorta APIs used by Intelligent Ingest, you will need to specify the Incorta host URL, tenant name, user name that has access to the schemas, and the password. Add the following parameters to the configuration file, replacing the example host URL, tenant name, user name, and password with the values for your instance:
#Incorta APIsspark.Incorta.host=http://127.0.0.1:8080/incortaspark.Incorta.tenantname=<TENANT_NAME>spark.Incorta.user=<USER_NAME>spark.Incorta.password=<PASSWORD>
By default the Intelligent Ingest logs are located in the following directory:
<INCORTA_INSTALLATION_PATH>/IncortaNode/Synapse/logs>
There are several optional configurations for your Incorta API parameters, you can review the options in the Optional Incorta API configurations section.
Table level Settings
All tables with keys will use the Hash distribution option. By default, it uses the first key.
All tables are indexed using Clustered columnstore indexes by default.
Optional configurations for table level settings are available in the Optional table level settings section.
Configure Your Azure Synapse Analytics SQL Pool
Perform the following configuration steps in your Azure Synapse Analytics SQL Pool before you load data into Synapse from Incorta:
- Create the Incorta login for your SQL pool for the master database
- Create the Incorta user with grants for the SQL pool database
It is assumed that you already have a Synapse workspace and dedicated SQL Pool from the Azure blade.
Create the Incorta login for your SQL pool for the master database
Here are the steps to create an Incorta login for your SQL pool for the master database. You must replace <PASSWORD> with a secure password.
- In Synapse Analytics, for your workspace, select New → SQL script.
- In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select master.
- In the SQL editor, run the following CREATE LOGIN command, replacing <PASSWORD> with a secure password for the new login.
CREATE LOGIN incorta WITH PASSWORD = '<PASSWORD>';
You will need to provide this username and password when you create an Azure Synapse Data Destination.
Create the Incorta user with grants for the SQL pool database
Here are the steps to create an Incorta user with grants for the selected SQL pool database:
- In Synapse Analytics, in Develop, select + → SQL script.
- In the SQL editor, in the SQL Script tab, in Connect to, first select your dedicated SQL Pool and then, in Use database, select your SQL pool database.
- In the SQLeditor canvas, run the following SQL commands:
CREATE USER incorta FOR LOGIN incorta;EXEC sp_addrolemember 'db_ddladmin', 'incorta';EXEC sp_addrolemember 'db_datawriter', 'incorta';EXEC sp_addrolemember 'db_datareader', 'incorta';GRANT ALTER ANY EXTERNAL FILE FORMAT to incorta;GRANT ADMINISTER DATABASE BULK OPERATIONS to incorta;
Create an Azure Synapse Data Destination
When you create an Azure Synapse Data Destination, you are able to seamlessly ingest physical schema tables from Incorta to your Azure Synapse instance.
Here are the steps:
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Destination.
- In the Choose a Data Destination dialog, in Schema Destination select Synapse Analytics.
- In the New Data Destination dialog, specify the following Asure Synapse schema destination properties:
- SQL Pool Endpoint: This is the dedicated SQL endpoint
- Database Name: This is the SQL pool name
- SQL Username: This is the username created in the Create the Incorta login for your SQL pool for the master database section above.
- SQL Password: This is the password created in the Create the Incorta login for your SQL pool for the master database section above.
- Port: The default value is 1433
- To test the connection, select Test Connection.
- To save, select Ok.
For more information on how to manage Data Destinations, refer to Tools → Data Manager.
Set a Data Destination for a Physical Schema
Set a data destination for physical schemas that will push tables to Azure Synapse. Here are the steps:
- In the Navigation bar, select Schema.
- In the Action bar, select Settings (gear icon) → Set Data Destination.
- In the Set Schema Destination for <SCHEMA_NAME> dialog,
- Select a Schema Destination or select + New Schema Data Destination to create a new one.
- Select Done.
A Synapse Analytics icon will appear to the left of the schema name to indicate a data destination has been set for the schema.
Create Materialized Views to Trigger Data Ingest
For each physical schema that will trigger data ingest into Azure Synapse, create a materialized view. Each materialized view should contain the following lines of code for full and incremental load scripts:
Script (full load script)
from incorta_synapse import load_to_synapsedf = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Full")save (df)
Incremental Script (incremental load script)
from incorta_synapse import load_to_synapsedf = load_to_synapse("<PHYSICAL_SCHEMA_NAME>", "Incremental")save (df)
These examples pass the schema name as a parameter in the load script. There is also an option to pass a list of tables in the same schema as a comma separated list of fully qualified names as follows:
df = load_to_synapse([<PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_1>, <PHYSICAL_SCHEMA_NAME>.<TABLE_NAME_2>], "Full")
Schedule Materialized View Load Jobs
You can schedule the materialized views that trigger data ingest into Azure Synapse to run in a specific order. For more information, refer to Tools → Scheduler.
The physical schema should be separate from the Synapse load schemas to support scheduling the load jobs.
Optional configurations
Throughout the Intelligent Ingest configuration process there are several optional configurations you can add or modify from the default settings.
Optional Incorta API configurations
You can override the default path for the Intelligent Ingest logs directory by adding the following line:
#Log directory pathspark.Synapse.logs.path=<patth_of_the_logs_folder>
You can set table level configurations using the Synapse mappings CSV file that is located by default in the following path:
<INCORTA_INSTALLATION_PATH>/IncortaNode/Synapse/synapse_mappings.csv
You can override the default path of the Synapse mappings CSV file by adding the following line:
spark.Synapse.mappings.csv=<sample_path_including_the_file_name>
You can add the following to override the default Synape connection string parameters:
#Synapse connection string parametersspark.Synapse.driver={ODBC Driver 17 for SQL Server}spark.Synapse.mars.connection.enabled=yesspark.Synapse.application.name=IncortaIntelligentIngest
You can add the following to override the target schema name and target table name in Synapse:
spark.Synapse.targetschemaname=<I_SCHEMA>spark.Synapse.targettablename=<I_TABLE>
spark.Synapse.targetschemaname
: Overrides the schema name in Synapsespark.Synapse.targettablename
: Appends a prefix to the table name when the target schema is overridden For example,
sets the destination table asspark.Synapse.targettablename=<I_SCHEMA>_<I_TABLE>targetschemaname.incortaschemaname_tablename
You can change the way you access the tenant directory using the following line:
spark.Synapse.copy.auth=principal
spark.Synapse.copy.auth:
Controls the ADLS authentication mechanism in the COPY statements. The possible values are:- principal(default): Use the service principal that uses the same credentials as in
core-site.xml
. - managed: Use Managed Identity authentication in Synapse and the ADLS storage account.
- principal(default): Use the service principal that uses the same credentials as in
Optional table level settings
You can change some settings at the table level, which include the table hash distribution key or indexing type of columns.
If you want to change the key, you can add the following to the synapse_mappings.csv
file:
<SCHEMA_NAME>,<TABLE_NAME>,<HASH_COLUMN>,false
If you want to change the indexing type of a table to Heap, you can add the following to the synapse_mappings.csv
file:
<SCHEMA_NAME>,<TABLE_NAME>,,true
Limitations
Intelligent Ingest limitations
- If you want to perform an incremental table load, the incremental column name must be LAST_UPDATE_DATE .
- Encrypted columns are not supported and will be ingested into Synapse as a string, as it exists in Parquet.
- SSO/LDAP users are not supported. The Python script uses only Incorta internal users for authentication.
- For incremental data loading, deduplication is only supported across multiple load jobs, as the solution is built on an assumption that a single load job does not have duplicates.
- If you have multiple tenants in Incorta, Synapse will be supported for a single tenant only.
- Tenant and schema export/import for data destinations are not supported.
Synapse limitations
Hashing, the methodology used for deduplication and key definition, supports a single key only. Composite keys are not supported.
A client error will be thrown if a date or timestamp column with a value less than 1.1.1970 is ingested.
The MV used to trigger ingest into Synapse may be stuck in the Allocating resource phase in Incorta due to one of the following reasons:
Multiple jobs are submitted to Synapse and are stuck in queue. Synapse will not terminate these jobs. You can use the following query to check the jobs in queue:
SELECT * FROM sys.dm_pdw_exec_requests WHERE status='Running' ;A large data volume is ingested while the SQL pool performance is small.