Connectors → Microsoft Azure Gen2
About Microsoft Azure Data Lake Storage Gen2
Microsoft Azure Data Lake Storage Gen2 (ADLS Gen2), as a Data Lake, allows for the storage of structured or unstructured data in its raw format. Azure Gen2 is designed for enterprise scale data storage and big data analytics processing. The Azure Gen2 service combines the capabilities of its prior storage service generations, Azure Blob storage and Azure Data Lake Storage Gen1. In doing so, Azure Gen2 allows the same data to be accessed as if in directory storage or blob storage. This combination of storage options gives access to a combination of features such as file system semantics, file level security, tiered storage, and disaster recovery capabilities.
About the Microsoft Azure Gen2 connector
With the Azure Gen2 connector, you can create a data source for an Azure Gen2 data lake storage source. The Azure Gen2 connector supports the following file extensions:
- Text (
.csv
,.tsv
,.tab
,.txt
) - Excel (
.xlsx
) - Parquet (
.parquet
) - Optimized Row Columnar (
.orc
)
The Azure Gen2 connector supports the following Incorta specific functionality:
Feature | Supported |
---|---|
Chunking | ✔ |
Data Agent | |
Encryption at Ingest | |
Incremental Load | ✔ |
Multi-Source | ✔ |
OAuth | |
Performance Optimized | ✔ |
Remote | ✔ |
Single-Source | ✔ |
Spark Extraction | |
Webhook Callbacks | ✔ |
The Azure Gen2 connector requires one of the following authentication configurations:
- Storage Account Key: a 512-bit key with specified storage access.
- Service Principal: role-based access control granting applications specified access.
Steps to Connect Azure Gen2 and Incorta
To connect your Azure Gen2 and Incorta, here are the high level steps, tools, and procedures:
- Create an external data source
- Create a schema with the Schema Wizard
- or, Create a schema with the Schema Designer
- Load the schema
- Explore the schema
Create an external data source
Here are the steps to create an external data source with the Azure Gen2 connector:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Source.
- In the Choose a Data Source dialog, in Data lake, select Data lake - Azure Gen2 data source.
- In the New Data Source dialog, specify the applicable connector properties.
- To test, select Test Connection.
- Select Ok to save your changes.
Azure Gen2 connector Authentication Type options
The Authentication Type will determine the connection properties for connecting to your Azure Gen2 data source. When creating an external data source, the following authenticaion types are available from a drop down list:
Type | Description |
---|---|
Storage Account Key | Select for authentication using a generated storage access key. A storage access key, or account key, can be assigned to only have access to specified storage. |
Service Principal | Select for service principal authentication through role-based access control. Used for creating application specified access. |
Azure Gen2 connector properties for Storage Account Key Authentication
Here are the properties for the Azure Gen2 connector when using Storage Account Key Authentication:
Property | Control | Description |
---|---|---|
Data Source Name | text box | Enter the name of the data source |
Account Key | text box | Enter the 512-bit authorization key. |
Directory | text box | Enter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.An abfs:// schema identifier will connect with a TLS connection. |
The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.
Azure Gen2 connector properties for Service Principal Authentication
Here are the properties for the Azure Gen2 connector when using Service Principal Authentication:
Property | Control | Description |
---|---|---|
Data Source Name | text box | Enter the name of the data source |
Client ID | text box | Enter client ID, also known as an application ID, which is created when registering an application. |
Tenant ID | text box | Enter the Tenant ID, also known as a directory ID, which identifies the tenant to use for authentication. |
Client Secret Key | text box | Enter the client secret key. This key is used for the client to prove identity during authentication. |
Directory | text box | Enter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.An abfs:// schema identifier will connect with a TLS connection. |
The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.
Create a schema with the Schema Wizard
Here are the steps to create a Azure Gen2 schema with the Schema Wizard:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Schema Wizard
- In (1) Choose a Source, specify the following:
- For Enter a name, enter the schema name.
- For Select a Datasource, select the Azure Gen2 external data source.
- Optionally, create a description.
- In the Schema Wizard footer, select Next.
- In (2) Manage Tables, in the Data panel, navigate the directory tree as necessary to select your file.
When navigating to the data source from the Data Panel, select files appropriately for creating a schema. File directories chosen at too high or low a directory level may result in a failure to retrieve data or incorrect scope of data for a table.
- In the Schema Wizard footer, select Next.
- In (3) Finalize, in the Schema Wizard footer, select Create Schema.
Create a schema with the Schema Designer
Here are the steps to create a Azure Gen2 schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Create Schema.
- In Name, specify the schema name, and select Save.
- In Start adding tables to your schema, select Data Lake.
- In the Data Source dialog, specify the various properties data source properties.
- Select Add.
- In the Table Editor, in the Table Summary section, enter the table name.
- To save your changes, select Done in the Action bar.
Azure Gen2 data source properties
You can specify a single file in the Data Source Dialog or a directory. Enable the Wildcard Union property to indicate the data source is a directory. Below are the data source properties divided by file type, single file, or folder.
Common data source properties for all file and directory types
Here are some of the common properties for all file and directory types:
The following common properties also apply to all available properties for an ORC (.orc
) file or directory.
Property | Control | Description |
---|---|---|
Type | drop down list | Default is File System |
Data Source | drop down list | Select the Azure Gen2 external data source |
File Type | drop down list | Select the Text (.csv , .tsv , .tab , .txt ), Excel (.xslx ), Parquet (.parquet ), or ORC (.orc ) file type. |
Incremental | toggle | Enable this property to support incremental loading. |
Update File | text box / button | With Incremental enabled, enter the relative file path of the text file to update from. When adding to an existing table, the select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add. |
Timestamp format in file name | drop down list | With Incremental enabled, select the timestamp format in the file name. |
Incremental Extract Using | drop down list | With Incremental and Wildcard Union enabled, select the extraction method. |
Wildcard Union | toggle | Enable this property to get data from a directory. |
Directory Path | text box | With Wildcard Union enabled, enter the directory path relative to the root directory specified in the data source. |
Apply Include Pattern On | drop down list | With Wildcard Union enabled, select this property to apply the Include Pattern to a file name or file relative path. |
Include | text box | With Wildcard Union enabled, enter a keyword with a wildcard * symbol to include specific named files within the folder. |
Include Sub-Directories | toggle | With Wildcard Union enabled, enable this property to include files from sub-folders |
Include Filename as a Column | toggle | With Wildcard Union enabled, enable this property to add the filename of the file as a column. You will then need to specify a column name. |
Filename column | text box | With Include Filename as a Column enabled, enter a column name for the filename, such as source_file_name |
Callback | toggle | Enables the Callback URL field |
Callback URL | text box | This property appears when the Callback toggle is enabled. Specify the URL. |
Summary of Data Access Methods Based on Remote and Performance Optimized Settings
Table Properties | Data Source Properties | Parquet | DDM | Memory | SQLi | MV/ Notebooks | Analytics |
---|---|---|---|---|---|---|---|
Performance Optimized = Off | Remote = On | No | No | No | Yes | Yes | No |
Performance Optimized = Off | Remote = Off | Yes | Yes | No | Yes | Yes | No, unless populated via MV/Notebook |
Performance Optimized = On | Remote = Off | Yes | Yes | Yes | Yes | Yes | Yes |
Text file or text file directory properties
Here are some of the properties specifically related to selecting a Text (.csv
, .tsv
, .tab
, .txt
) file or text file directory:
Property | Control | Description |
---|---|---|
Has Header? | toggle | Select if the first row contains column header values |
Rows to skip | numerical input | Select the number of rows in a file to skip. The default is 0. |
File Path | text box | Enter the relative path to the root directory as specified in the data source. Example: SALES/Q1.csv |
Date Format | drop down list | Select the format for date values in the file. |
Timestamp Format | drop down list | Select the format for timestamp value in the file. |
Character Set | drop down list | Select the character set of the Text file. |
Separator | drop down list | Select the character used for line separation. |
Other | text box | This property is available when the Separator is set to Other. Enter one or more characters to specify the column separator or delimiter between values in a row. |
Enable Chunking | toggle | Enable this property for large file sizes |
Chunk Size (MB) | text box | Enter a value in megabytes (MB) to specify the chunk size |
Excel file or excel file directory properties
Here are the specific properties for an Excel Workbook (.xlsx
) file or Excel directory:
Property | Control | Description |
---|---|---|
Worksheet | drop down list | Select a given worksheet for the Excel Workbook. |
Update file | text box / button | With Incremental enabled, enter the relative file path of the desired update file. When adding to an existing table, the Select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add. |
Update Worksheet | text box | With Incremental enabled and Wildcard Union disabled, select the desired worksheet in the update file. |
This release has limited support for Union Files for Excel Workbook (.xlsx
) files. The Loader Service only loads Worksheets with the same name as defined in the table data source properties. For this reason, each Excel Workbook file in the selected folder must have a common Worksheet tab name. You must select this common Worksheet name in the drop down list.
Parquet file or parquet directory properties
Here are the properties specific to a parquet (.parquet
) file or parquet directory:
Property | Control | Description |
---|---|---|
Read data as partitions | toggle | Enable this property to have data read as parquet partitions. |
View the schema diagram with the Schema Diagram Viewer
Here are the steps to view the schema diagram using the Schema Diagram Viewer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the list of schemas, select the Azure Gen2 schema.
- In the Schema Designer, in the Action bar, select Diagram.
Load the schema
Here are the steps to perform a Full Load of the Azure Gen2 schema using the Schema Designer:
Incorta Direct Data Platform
- In the Navigation bar, select Schema.
- In the list of schemas, select the Azure Gen2 schema.
- In the Schema Designer, in the Action bar, select Load → Load Now → Full.
- To review the load status, in Last Load Status, select the date.
Explore the schema
With the full load of the Azure Gen2 schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.
To open the Analyzer from the schema, follow these steps:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Schema Manager, in the List view, select the Azure Gen2 schema.
- In the Schema Designer, in the Action bar, select Explore Data.