Connectors → Local Files
About the Local Files connector
The Local Files connector allows you to connect to directory that is available to the host of the Analytics Service. The directory is typically a shared mount, but can also be the same path to the data folder for a given tenant in Shared Storage. The Local Files connector supports ingesting one or more files of the following types:
- Text (
.csv
,.tsv
,.tab
,.txt
) - Excel (
.xlsx
) - Parquet (
.parquet
) - Optimized Row Columnar (
.orc
)
Local Files Connector Updates
This section is to explore the updates in the newer versions of the Local Files connector available on the Incorta connectors marketplace.
In order to get the newer version of the connector, please update the connector using the marketplace.
Version | Updates |
---|---|
2.0.1.8 | Fixed an issue with versions from 2.0.1.0 to 2.0.1.7 of the Local Files connector that might have affected users who use Wildcard Union on directories containing a large number of files, resulting in load failures or longer load times |
Keep your connector up-to-date with the latest connector version released to get all introduced fixes and enhancements.
About the Local Files connector
The Local Files connector also supports the use of Remote tables. Remote tables enable you to access large CSV, Parquet, and ORC files without loading them into memory. Duplicating these large files wastes disk space and uses too much memory when only a small portion of this data might be needed.
It is not possible to utilize the Local Files connector as a Remote table for Microsoft Excel files.
The Local File connector supports the following configurations as a data source for a physical schema table:
Feature | Supported |
---|---|
Chunking | ✔ |
Data Agent | ✔ |
Encryption at Ingest | |
Incremental Load | ✔ |
Multi-Source | ✔ |
OAuth | |
Performance Optimized | ✔ |
Remote | ✔ |
Single-Source | ✔ |
Spark Extraction | |
Webhook Callbacks | ✔ |
An Linux Administrator or Systems Administrator for the host with directory access must provide the necessary access to the directory for the incorta
Linux user or other Linux user that is running the Analytics Service, Loader Service, or Data Agent process.
Steps to use the Local Files connector
Here are the high level steps to use the Local Files connector:
- Create an external data source
- Create a physical schema with the Schema Wizard
- or, Create a physical schema table with the Schema Designer
- Load the physical schema
- Explore the physical schema
Create an external data source
Here are the steps to create a external data source with the Local Files connector:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Source.
- In the Choose a Data Source dialog, in Data lake, select Data Lake - Local Files.
- In the New Data Source dialog, specify the applicable connector properties.
- To test, select Test Connection.
- Select Ok to save your changes.
Local Files connector properties
Here are the properties for the Local Files connector:
Property | Control | Description |
---|---|---|
Data Source Name | text box | Enter the name of the data source |
Directory | text box | Enter the parent directory path. |
Use Data Agent | toggle | Enable to support the use of a data agent |
Data Agent | dropdown list | Select a data agent that has access to the directory path. |
Create a physical schema with the Schema Wizard
Here are the steps to create a Local Files physical schema with the Schema Wizard:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Schema Wizard
- In (1) Choose a Source, specify the following:
- For Enter a name, enter the physical schema name.
- For Select a Datasource, select the Local Files external data source.
- Optionally create a description.
- In the Schema Wizard footer, select Next.
- In (2) Manage Tables, in the Data Panel, navigate the directory tree as necessary to select the Local Files files. You can either check the Select All checkbox or select individual sheets.
- In the Schema Wizard footer, select Next.
- In (3) Finalize, in the Schema Wizard footer, select Create Schema.
Create a physical schema with the Schema Designer
Here are the steps to create a Local Files physical schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Create Schema.
- In Name, specify the physical schema name, and select Save.
- In Start adding tables to your schema, select Data Lake.
- In the Data Source dialog, specify the Local Files table data source properties.
- Select Add.
- In the Table Editor, in the Table Summary section, enter the table name.
- To save your changes, select Done in the Action bar.
Local Files table data source properties
For a physical schema table in Incorta, you can define the Local Files specific data source properties as follows:
Property | Control | Description |
---|---|---|
Type | drop down list | Default is Data Lake |
Data Source | drop down list | Select the Local Files external data source |
Remote | toggle | Enable this option to remotely access file data, which means no data is loaded to Incorta. See the Summary of Data Access Methods table for details on how setting this and the Performance Optimized property affects data accessibility. |
File Type | drop down list | Select a file type option: ● Text (csv, tsv, tab, txt) ● Excel (xlsx) - not an option with Remote enabled ● Parquet ● ORC |
Incremental | toggle | Enables incremental loading for the physical schema table |
Has Header? | toggle | This property appears when Remote is disabled and the File Type is Text. Enable this property to indicate the data source has a header row. |
Rows to Skip | text box | This property appears when Remote is disabled and the File Type is Text. Enter the number of rows to skip from the top of the file. |
Wildcard Union | toggle | Enable this property to get incremental data file updates from an existing directory |
File Path | text box | This property appears when Wildcard Union is disabled. Enter the path to the data file, relative to the root directory configured in the data source. |
Worksheet | text box | This property appears when Wildcard Union is disabled and the File Type is Excel. Select the data file worksheet of interest. |
Update File | text box | This property appears when Incremental is enabled and Wildcard Union is disabled. Enter the path to the update file, relative to the root directory configured in the data source. |
Update Worksheet | text box | This property appears when the File Type is Excel, Incremental is enabled, and Wildcard Union is disabled. Select the update file worksheet of interest. |
Incremental Extract Using | drop down list | This property appears when Incremental and Wildcard Union are enabled. Select an incremental load method. |
Timestamp format in file name | drop down list | This property appears when the Timestamp in File Name option is selected for the Incremental Extract Using property. Select the timestamp format that appears in the file name. |
Directory Path | text box | This property appears when Wildcard Union is enabled. Enter the path to the directory, relative to the root directory configured in the data source. To use the root directory, enter ./ or . |
Apply Include Pattern on | drop down list | This property appears when Wildcard Union is enabled. Select either: ● File Name - apply pattern on all file names in the selected directory path ● File Relative Path - apply pattern on relative path in the selected directory path |
Include | text box | This property appears when Wildcard Union is enabled. This property appears when Wildcard Union is enabled. To include only certain files in the load process, enter a prefix to compare against: ● The names of the files in a directory if Apply Include Pattern on has a value of File Name. For example, entering sales* .parquet will load only those files that start with the word sales and end with .parquet .● The relative path in a directory if Apply Include Pattern has a value of File Relative Path. For example, entering sales will load those files in the sales directory. |
Exclude | text box | This property appears when Wildcard Union is enabled. To exclude files from the load process, enter a prefix to compare against. Files that match the prefix will not be loaded. |
Include Sub-Directories | toggle | This property appears when Wildcard Union is enabled. Enable this property to load files within all subdirectories of the directory path hierarchy. If an Include prefix is specified, only files or relative paths in the subdirectories matching the prefix will be loaded. |
Include Filename as a Column | toggle | This property appears when Wildcard Union is enabled. Enable this property to add the file name as the first column in the physical schema table. |
Date Format | drop down list | This property appears when the File Type is Text. Select the text file date format. |
Timestamp Format | drop down list | This property appears when the File Type is Text. Select the text file timestamp format. |
Character Set | drop down list | This property appears when the File Type is Text. Select the text file character set. |
Separator | drop down list | This property appears when the File Type is Text. Select the text file separator. |
Enable Chunking | toggle | This property appears when the File Type is Text. Turn this property on to process the text file in chunks. |
Callback | toggle | Enable this property on to expose the Callback URL field |
Callback URL | text box | This property appears when the Callback toggle is enabled. Specify the URL. |
Summary of Data Access Methods Based on Remote and Performance Optimized Settings
Table Properties | Data Source Properties | Parquet | DDM | Memory | SQLi | Materialized View | Analytics |
---|---|---|---|---|---|---|---|
Performance Optimized = Off | Remote = On | No | No | No | Yes | Yes | No |
Performance Optimized = Off | Remote = Off | Yes | Yes | No | Yes | Yes | No, unless created by a materialized view or Notebook |
Performance Optimized = On | Remote = Off | Yes | Yes | Yes | Yes | Yes | Yes |
Incremental Extract Methods
There are two extract methods, Last Successful Extract Time and Timestamp in File Name.
Last Successful Extract Time
The Last Successful Extract Time option instructs the Loader Service to extract data from the table source using the time of the previous successful extraction as the starting time for next extraction. Here is an example use case for the Last Successful Extract Time option:
/path/to/sales
is a directory that contains the following files:/path/to/sales/sales_california.parquet
/path/to/sales/sales_newyork.parquet
/path/to/sales/sales_illinois.parquet
When you perform an initial full load, the Loader Service extracts all the files from the directory. After the successful load, a new file is added to the directory: sales_ohio.parquet
. As the new file has a last modified timestamp that is more recent than the timestamp of the previous successful extraction, the next incremental load will only extract the sales_ohio.parquet
file.
Timestamp in File Name
The Timestamp in File Name option instructs the Loader Service to extract data using the file name. The file name must contain a specific timestamp format. Here is an example use case for the Timestamp in File Name option:
/path/to/sales
is a directory that contains the following files:/path/to/sales/sales_2021-04-01.parquet
/path/to/sales/sales_2021-04-02.parquet
/path/to/sales/sales_2021-04-03.parquet
When you perform an initial full load, the Loader Service extracts all the files from the directory. After the successful load, a new file is added to the directory: sales_2021-04-04.parquet
. As the new file has a timestamp name value that is greater than the already extracted timestamp name values, the next incremental load will only extract the sales_2021-04-04.parquet
file.
Timestamp Formats in File Name
- yyyy-MM-dd
- dd.MM.yyyy
- dd-MMM-yy
- dd-MMM-yyyy
- yyyy-MM-dd HH.mm.ss
- Unix Epoch (seconds)
- Unix Epoch (milliseconds)
Text File Date Format
- yyyy-MM-dd
- dd/MM/yyyy
- dd.MM.yyyy
- dd/MMM/yyyy
- dd-MMM-yy
- dd-MMM-yyyy
- MM/dd/yyyy
- yyyy/MM/dd
- Unix Epoch (seconds
- Unix Epoch (milliseconds)
- Other
Text File Timestamp Format
- yyyy-MM-dd HH:mm:ss
- yyyy-MM-dd HH.mm.ss
- yyyy-MM-dd HH:mm:ss.SSS
- dd/MM/yyyy HH:mm:ss
- dd/MM/yyyy HH.mm.ss
- dd/MM/yyyy HH:mm:ss.SSS
- Unix Epoch (seconds
- Unix Epoch (milliseconds)
- Other
Text File Character Set
- US-ASCII
- ISO-8859-1
- UTF-8
- UTF-16BE
- UTF-16LE
- UTF-16
Text File Separator
- Comma
- Tab
- Other
View the physical schema diagram with the Schema Diagram Viewer
Here are the steps to view the schema diagram using the Schema Diagram Viewer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the list of physical schemas, select the Local Files physical schema.
- In the Schema Designer, in the Action bar, select Diagram.
Load the physical schema
Here are the steps to perform a Full Load of the Local Files physical schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform.
- In the Navigation bar, select Schema.
- In the list of physical schemas, select the Local Files physical schema.
- In the Schema Designer, in the Action bar, select Load → Load Now → Full.
- To review the load status, in Last Load Status, select the date.
Explore the physical schema
With the full load of the Local Files physical schema completed, you can use the Analyzer to explore the physical schema, create your first insight, and save the insight to a new dashboard.
To open the Analyzer from the physical schema, follow these steps:
- In the Navigation bar, select Schema.
- In the Schema Manager, in the List view, select the Local Files physical schema.
- In the Schema Designer, in the Action bar, select Explore Data.