Connectors → Microsoft Azure Gen2

About Microsoft Azure Data Lake Storage Gen2

Microsoft Azure Data Lake Storage Gen2 (ADLS Gen2), as a Data Lake, allows for the storage of structured or unstructured data in its raw format. Azure Gen2 is designed for enterprise scale data storage and big data analytics processing. The Azure Gen2 service combines the capabilities of its prior storage service generations, Azure Blob storage and Azure Data Lake Storage Gen1. In doing so, Azure Gen2 allows the same data to be accessed as if in directory storage or blob storage. This combination of storage options gives access to a combination of features such as file system semantics, file level security, tiered storage, and disaster recovery capabilities.

Microsoft Azure Gen2 Connector Updates

This section is to explore the updates in the newer versions of the Microsoft Azure Gen2 connector available on the Incorta connectors marketplace.

In order to get the newer version of the connector, please update the connector using the marketplace.

VersionUpdates
2.0.1.8Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Microsoft Azure Gen2 connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times
Recommendation

Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.

About the Microsoft Azure Gen2 connector

With the Azure Gen2 connector, you can create a data source for an Azure Gen2 data lake storage source. The Azure Gen2 connector supports the following file extensions:

  • Text (.csv, .tsv, .tab, .txt)
  • Excel (.xlsx)
  • Parquet (.parquet)
  • Optimized Row Columnar (.orc)

The Azure Gen2 connector supports the following Incorta specific functionality:

FeatureSupported
Chunking
Data Agent
Encryption at Ingest
Incremental Load
Multi-Source
OAuth
Performance Optimized
Remote
Single-Source
Spark Extraction
Webhook Callbacks

The Azure Gen2 connector requires one of the following authentication configurations:

Steps to Connect Azure Gen2 and Incorta

To connect your Azure Gen2 and Incorta, here are the high level steps, tools, and procedures:

Create an external data source

Here are the steps to create an external data source with the Azure Gen2 connector:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Data.
  • In the Action bar, select + NewAdd Data Source.
  • In the Choose a Data Source dialog, in Data lake, select Data lake - Azure Gen2 data source.
  • In the New Data Source dialog, specify the applicable connector properties.
  • To test, select Test Connection.
  • Select Ok to save your changes.

Azure Gen2 connector Authentication Type options

The Authentication Type will determine the connection properties for connecting to your Azure Gen2 data source. When creating an external data source, the following authenticaion types are available from a drop down list:

TypeDescription
Storage Account KeySelect for authentication using a generated storage access key. A storage access key, or account key, can be assigned to only have access to specified storage.
Service PrincipalSelect for service principal authentication through role-based access control. Used for creating application specified access.

Azure Gen2 connector properties for Storage Account Key Authentication

Here are the properties for the Azure Gen2 connector when using Storage Account Key Authentication:

PropertyControlDescription
Data Source Nametext boxEnter the name of the data source
Account Keytext boxEnter the 512-bit authorization key.
Directorytext boxEnter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.
An abfs:// schema identifier will connect with a TLS connection.
Note

The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.

Azure Gen2 connector properties for Service Principal Authentication

Here are the properties for the Azure Gen2 connector when using Service Principal Authentication:

PropertyControlDescription
Data Source Nametext boxEnter the name of the data source
Client IDtext boxEnter client ID, also known as an application ID, which is created when registering an application.
Tenant IDtext boxEnter the Tenant ID, also known as a directory ID, which identifies the tenant to use for authentication.
Client Secret Keytext boxEnter the client secret key. This key is used for the client to prove identity during authentication.
Directorytext boxEnter the URI address to connect to Azure Data Lake Gen2 data source. Use the abfs:// schema identifier when not using TLS.
An abfs:// schema identifier will connect with a TLS connection.
Note

The URI syntax in the Directory connection property is dependent on whether you are connecting to a default file system.

Create a schema with the Schema Wizard

Here are the steps to create a Azure Gen2 schema with the Schema Wizard:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Schema Wizard
  • In (1) Choose a Source, specify the following:
    • For Enter a name, enter the schema name.
    • For Select a Datasource, select the Azure Gen2 external data source.
    • Optionally, create a description.
  • In the Schema Wizard footer, select Next.
  • In (2) Manage Tables, in the Data panel, navigate the directory tree as necessary to select your file.
Note

When navigating to the data source from the Data Panel, select files appropriately for creating a schema. File directories chosen at too high or low a directory level may result in a failure to retrieve data or incorrect scope of data for a table.

  • In the Schema Wizard footer, select Next.
  • In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create a Azure Gen2 schema using the Schema Designer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Action bar, select + New → Create Schema.
  • In Name, specify the schema name, and select Save.
  • In Start adding tables to your schema, select Data Lake.
  • In the Data Source dialog, specify the various properties data source properties.
  • Select Add.
  • In the Table Editor, in the Table Summary section, enter the table name.
  • To save your changes, select Done in the Action bar.

Azure Gen2 data source properties

You can specify a single file in the Data Source Dialog or a directory. Enable the Wildcard Union property to indicate the data source is a directory. Below are the data source properties divided by file type, single file, or folder.

Common data source properties for all file and directory types

Here are some of the common properties for all file and directory types:

Note

The following common properties also apply to all available properties for an ORC (.orc) file or directory.

PropertyControlDescription
Typedrop down listDefault is File System
Data Sourcedrop down listSelect the Azure Gen2 external data source
File Typedrop down listSelect the Text (.csv, .tsv, .tab, .txt), Excel (.xlsx), Parquet (.parquet), or ORC (.orc) file type.
IncrementaltoggleEnable this property to support incremental loading.
Update Filetext box / buttonWith Incremental enabled, enter the relative file path of the text file to update from. When adding to an existing table, the select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add.
Timestamp format in file namedrop down listWith Incremental enabled, select the timestamp format in the file name.
Incremental Extract Usingdrop down listWith Incremental and Wildcard Union enabled, select the extraction method.
Wildcard UniontoggleEnable this property to get data from a directory.
Directory Pathtext boxWith Wildcard Union enabled, enter the directory path relative to the root directory specified in the data source.
Apply Include Pattern ondrop down listWith Wildcard Union enabled, select this property to apply the Include Pattern to a file name or file relative path.
Includetext boxWith Wildcard Union enabled, enter a keyword with a wildcard * symbol to include specific named files within the folder.
Excludetext boxWith Wildcard Union enabled, enter a keyword with a wildcard * symbol to exclude specific named files within the folder.
Include Sub-DirectoriestoggleWith Wildcard Union enabled, enable this property to include files from sub-folders
Include Filename as a ColumntoggleWith Wildcard Union enabled, enable this property to add the filename of the file as a column. You will then need to specify a column name.
Filename columntext boxWith Include Filename as a Column enabled, enter a column name for the filename, such as source_file_name
CallbacktoggleEnables the Callback URL field
Callback URLtext boxThis property appears when the Callback toggle is enabled. Specify the URL.

Summary of Data Access Methods Based on Remote and Performance Optimized Settings

Table PropertiesData Source PropertiesParquetDDMMemorySQLiMV/ NotebooksAnalytics
Performance Optimized = OffRemote = OnNoNoNoYesYesNo
Performance Optimized = OffRemote = OffYesYesNoYesYesNo, unless populated via MV/Notebook
Performance Optimized = OnRemote = OffYesYesYesYesYesYes

Text file or text file directory properties

Here are some of the properties specifically related to selecting a Text (.csv, .tsv, .tab, .txt) file or text file directory:

PropertyControlDescription
Has Header?toggleSelect if the first row contains column header values
Rows to skipnumerical inputSelect the number of rows in a file to skip. The default is 0.
File Pathtext boxEnter the relative path to the root directory as specified in the data source.
Example: SALES/Q1.csv
Date Formatdrop down listSelect the format for date values in the file.
Timestamp Formatdrop down listSelect the format for timestamp value in the file.
Character Setdrop down listSelect the character set of the Text file.
Separatordrop down listSelect the character used for line separation.
Othertext boxThis property is available when the Separator is set to Other. Enter one or more characters to specify the column separator or delimiter between values in a row.
Enable ChunkingtoggleEnable this property for large file sizes
Chunk Size (MB)text boxEnter a value in megabytes (MB) to specify the chunk size

Excel file or excel file directory properties

Here are the specific properties for an Excel Workbook (.xlsx) file or Excel directory:

PropertyControlDescription
Worksheetdrop down listSelect a given worksheet for the Excel Workbook.
Update filetext box / buttonWith Incremental enabled, enter the relative file path of the desired update file. When adding to an existing table, the Select button opens the Add File dialog. The Add File dialog shows the files from your Azure Gen2 data source. Select a single file and select Add.
Update Worksheettext boxWith Incremental enabled and Wildcard Union disabled, select the desired worksheet in the update file.
Important

This release has limited support for Union Files for Excel Workbook (.xlsx) files. The Loader Service only loads Worksheets with the same name as defined in the table data source properties. For this reason, each Excel Workbook file in the selected folder must have a common Worksheet tab name. You must select this common Worksheet name in the drop down list.

Parquet file or parquet directory properties

Here are the properties specific to a parquet (.parquet) file or parquet directory:

PropertyControlDescription
Read data as partitionstoggleEnable this property to have data read as parquet partitions.

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the schema diagram using the Schema Diagram Viewer:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the Azure Gen2 schema using the Schema Designer:

Incorta Direct Data Platform

  • In the Navigation bar, select Schema.
  • In the list of schemas, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Load → Load Now → Full.
  • To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the Azure Gen2 schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps:

  • Sign in to the Incorta Direct Data Platform.
  • In the Navigation bar, select Schema.
  • In the Schema Manager, in the List view, select the Azure Gen2 schema.
  • In the Schema Designer, in the Action bar, select Explore Data.