Connectors → XML
About XML
EXtensible Markup Language (XML) is a text-based markup language for both structuring and encoding data in a format that is both human-readable and machine-readable. An XML document consists of elements, attributes, and values. An element is an arbitrary tag such as <products></products>
. Within the closure of the element tag, there is a data value or another element such as <products><product>60-inch LCD Television</product></products>
. An attribute is an arbitrary name-value pair that exists within an element such as <products><product sku="abc1235456">60-inch LCD Television</product></products>
.
As a data structure and data format, XML is often hierarchical. A hierarchy consists of a grandparent element with sub-elements. From the perspective of the grandparent element, you can refer to the sub-elements as parents, children, and grandchildren. From the perspective of a grandchild element, you can refer to grandparents, parents, and siblings.
To help query XML, there is XPath (XML Path Language). XPath is a query language for selecting one or more elements, element values, or attribute values. Because an element is an arbitrary tag name, XPath refers to an element as a node. You can specify an XPath expression to query various nodes within XML.
About the XML Connector
The XML Connector utilizes the CDATA JDBC driver for XML. In order to query an XML resource using a SQL SELECT statement, the driver supports the configuration of various JDBC connection properties, including OAuth authentication. The XML connector also supports XPath query expressions.
As a result, the XML connector allows you to connect to a local or remote XML resource using an Universal Resource Indicator (URI) which includes both service and storage providers such as AWS S3, Azure, Box, Dropbox, FTP, HTTP, HTTPS, Google Cloud Storage, Google Drive, OneDrive, Sharepoint, and more.
In addition, the XML connector allows for the configuration of an XML data model: Document, FlattenedDocuments, or Relational.
The XML connector supports the following:
Feature | Supported |
---|---|
Chunking | ✔ |
Data Agent | |
Encryption at Ingest | |
Incremental Load | ✔ |
Multi-Source | ✔ |
OAuth | |
Performance Optimized | ✔ |
Remote | |
Single-Source | ✔ |
Spark Extraction | |
Webhook Callbacks | ✔ |
Deploy the JAR file
The XML connector requires the following JAR file:
cdata.jdbc.xml.jar
The XML connector requires the deployment of a JAR file to the Incorta Node hosts of the Analytics Service and the Loader Service. A systems administrator with root access to the host can deploy the JAR file. A CMC Administrator can restart the Incorta cluster.
The XML connector requires a JAR file that Incorta tests and verifies. The JAR download is only available from Incorta Support and must be purchased from Incorta. The XML connector exposes various properties of the CDATA JDBC driver for XML for an external data source. The CDATA JDBC driver for XML documentation is available at CData JDBC Driver for XML.
Here the steps to copy the JAR file to standalone Incorta cluster:
Secure copy the
cdata.jdbc.xml.jar
file to the host. Here is an example using scp:INCORTA_NODE_HOST=100.101.102.103cd ~/Downloadsscp -i ~/.ssh/host_pemkey.pem cdata.jdbc.xml.jar incorta@${INCORTA_NODE_HOST}:/tmp/Secure shell into the host
ssh -i ~/.ssh/host_pemkey.pem incorta@${INCORTA_NODE_HOST}
Copy the
cdata.jdbc.xml.jar
to theIncortaNode/runtime/lib/
directory in bash shellsudo su incortaINCORTA_INSTALLATION_PATH=/home/incorta/IncortaAnalyticscp /tmp/cdata.jdbc.xml.jar $INCORTA_INSTALLATION_PATH/IncortaNode/runtime/lib/cdata.jdbc.xml.jar
Here are the steps to restart the standalone Incorta cluster:
- Sign in to the Cluster Management Console (CMC) as the CMC Administrator.
- In the Navigation bar, select Clusters.
- Select the cluster name in the list.
- In Details, select Restart.
Steps to connect XML and Incorta
To connect XML and Incorta, here are the high level steps, tools, and procedures:
- Create an external data source
- Create a physical schema with the Schema Wizard
- or, Create a physical schema table with the Schema Designer
- Load the physical schema
- Explore the physical schema
Create an external data source
A Tenant Administrator (Super User), a user that belongs to a group with the SuperRole role, or a user that belongs to a group with the Schema Manager role can create an external data source for a given tenant.
Here are the steps to create a external data source with the XML connector:
- Sign in to the Incorta Direct Data Platform™
- In the Navigation bar, select Data.
- In the Action bar, select + New → Add Data Source.
- In the Choose a Data Source dialog, select XML.
- In the New Data Source dialog, specify the applicable connector properties.
- To test, select Test Connection.
- Select Ok to save your changes.
XML connector properties
Here are the basic properties for the XML connector:
Property | Control | Description |
---|---|---|
Data Source Name | text box | Required. Enter the name of the external data source. |
URI | text box | Required. The Uniform Resource Identifier (URI) for the XML resource location. |
XPath | text box | Optional. The XPath of an element that repeats at the same height within the XML document (used to split the document into multiple rows). |
Use Connection Pooling | toggle | Optional. Enable to set the related properties for connection pooling. |
Data Model | drop down list | Specifies the data model to use when parsing documents and generating the database metadata. Please refer to Modeling XML Data. |
Service Provider | drop down list | Specifies the local or remote source or service for the XML resource. The Service Provider selection dynamically affects the available properties for configuration. |
Generate Schema Files | drop down list | Required. Indicates when to create an XML connector schema for the selected Data Model option for the XML resource. |
Show Advanced Options | toggle | Optional. Enable to configure the advanced properties. Please refer to Establishing a Connection. |
For the above properties, please refer to the CData JDBC Driver for XML documentation for Establishing a Connection. To learn more about the Data Model property options, please refer to Modeling XML Data.
Create a physical schema with the Schema Wizard
Here are the steps to create a physical schema with the Schema Wizard:
- Sign in to the Incorta Direct Data Platform™.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Schema Wizard.
- In (1) Choose a Source, specify the following:
- For Enter a name, enter the schema name.
- For Select a Datasource, select the XML external data source.
- Optionally create a description.
- In the Schema Wizard footer, select Next.
- In (2) Manage Tables, in the Data panel, first select the name of the Data Source, and then check the Select All checkbox.
- In the Schema Wizard footer, select Next.
- In (3) Finalize, in the Schema Wizard footer, select Create Schema.
Create a physical schema with the Schema Designer
Here are the steps to create a physical schema table using the Schema Designer:
- Sign in to the Incorta Direct Data Platform™.
- In the Navigation bar, select Schema.
- In the Action bar, select + New → Create Schema.
- In Name, specify the schema name, and select Save.
- In Start adding tables to your schema, select XML.
- In the Data Source dialog, specify the XML table data source properties.
- Select Add.
- In the Table Editor, in the Table Summary section, enter the table name.
- To save your changes, select Done in the Action bar.
XML table data source properties
For a physical schema table, you can define the following XML specific data source properties as follows:
Property | Control | Description |
---|---|---|
Type | drop down list | Default is XML |
Data Source | drop down list | Select the XML external data source |
Incremental | toggle | Enable to configure the incremental load configuration relate properties |
Incremental Extract Using | drop down list | Enable Incremental to configure this property. Select between Last Successful Extract Time and Maximum Value of a Column. See Types of Incremental Load. |
Incremental Column | drop down list | Enable Incremental and select Maximum Value of a Column to configure this property. Select the column to be used for Maximum Value of a Column. The Loader will track and use the greatest value or most recent timestamp for each load operation. |
Query | text box | In the Edit Query editor, enter the SQL SELECT query to retrieve data from the XML dataset. |
Update Query | text box | Enable Incremental to configure this property. In the Edit Query editor, enter the SQL SELECT query to retrieve data from the XML dataset. |
Incremental Field Type | drop down list | Enable Incremental to configure this property. Select the format of the table date column: ● Numeric ● Timestamp ● Unix Epoch (seconds) ● Unix Epoch (milliseconds) |
Fetch Size | text box | Used for performance improvement, fetch size defines the number of records that will be retrieved from the database in each batch until all records are retrieved. The default is 5000. |
Chunking Method | drop down list | Chunking methods allow for parallel extraction of large tables. The default is No Chunking. There are two chunking methods: ● By Size of Chunking (Single Table) ● By Date/Timestamp |
Chunk Size | text box | Select By Size of Chunking for the Chunking Method to set this property. Enter the number of records to extract in each chunk in relation to the Fetch Size. The default is 3 times the Fetch Size. |
Order Column | drop down list | Select By Size of Chunking for the Chunking Method to set this property. Select a column in the source table you want to order by before chunking. It's typically an ID column and it must be numeric. |
Upper Bound for Order Column | text box | Optional. Enter the maximum value for the order column. |
Lower Bound for Order Column | text box | Optional. Enter the minimum value for the order column. |
Order Column [Date/Timestamp] | drop down list | Select By Date/Timestamp for the Chunking Method to set this property. Select a column in the source table you want to order by before chunking. It should be a Date/Timestamp column. |
Chunk Period | drop down list | Select the chunk period that will be used in dividing chunks: ● Daily ● Weekly (default) ● Monthly ● Yearly ● Custom |
Callback | toggle | Enable this option to call back on the source data set |
Callback URL | text box | Enable Callback to configure this property. Specify the URL. |
View the physical schema diagram with the Schema Diagram Viewer
Here are the steps to view the physical schema diagram using the Schema Diagram Viewer:
- Sign in to the Incorta Direct Data Platform™.
- In the Navigation bar, select Schema.
- In the list of schemas, select the XML physical schema.
- In the Schema Designer, in the Action bar, select Diagram.
Load the physical schema
Here are the steps to perform a Full Load of the XML physical schema using the Schema Designer:
- Sign in to the Incorta Direct Data Platform™.
- In the Navigation bar, select Schema.
- In the list of schemas, select the XML physical schema.
- In the Schema Designer, in the Action bar, select Load → Load Now → Full.
- To review the load status, in Last Load Status, select the date.
Explore the physical schema
With the full load of the XML physical schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.
To open the Analyzer from the physical schema, follow these steps:
- In the Navigation bar, select Schema.
- In the Schema Manager, in the List view, select the XML physical schema.
- In the Schema Designer, in the Action bar, select Explore Data.
For more information about how to use the Analyzer to create insights, see Analyzer and Visualizations.
Additional Considerations
Types of Incremental Load
You can enable Incremental Load for an XML data source. There are two types of incremental extracts:
Last Successful Extract Time
Fetch updates since the last time the tables were loaded. This is determined by the difference between the current time and the database timestamp.
Maximum Value of a Column
The Incremental Field Type value determines how the incremental load functions for the table. The XML connector supports both timestamp and numeric columns. A timestamp column is of the type date or timestamp. A numeric column is of the type int or long.
Changing the Incremental Field Type require a full load of the table to ensure data integrity.
Incremental Load Example
In this example, the people
XML must contain a column of the type Date or Timestamp in order to load the table incrementally with a last successful extract time strategy. In this case, the name of the date column is [personal.modifieddate]
and the format of the column is Timestamp.
Here are the data source property values for this example:
Incremental is enabled
Query contains
SELECT[personal.age] AS age,[personal.gender] AS gender,[personal.name.first] AS name_first,[personal.name.last] AS name_last,[personal.modifieddate] as modified_date[source],[vehicles]FROM[people]
Update Query contains
SELECT[personal.age] AS age,[personal.gender] AS gender,[personal.name.first] AS name_first,[personal.name.last] AS name_last,[personal.modifieddate] as modified_date[source],[vehicles]FROM[people]WHERE [personal.modifieddate] > ?
?
is a variable in the update query that contains the last physical schema refresh date.
Incremental Field Type = Timestamp
If running an update query for an incremental load, you are able to use the ?
reference character. The ?
character will be replaced with the last incremental reference to construct a valid query to the database. The ?
reference character is not valid in a standard query.
Valid Query Types
When creating a query for the XML connector, only SELECT
statements are valid.