Connectors → Cosmos DB

About Cosmos DB

Azure Cosmos DB is a fully managed NoSQL database for modern app development. It is Microsoft's proprietary, globally distributed, multi-model database service for managing data on a global scale. It is a schema-agnostic and horizontally scalable database service as well.

Azure Cosmos DB offers multiple database APIs, which include the following:

Core (SQL) API
API for MongoDB
Cassandra API
Gremlin API
Table API

These APIs allow your applications to treat Azure Cosmos DB as if it were other database technologies, without the overhead of management and scaling approaches.

Cosmos DB resource model

Internally, Cosmos DB stores items in containers. These two terms refer to different entities depending on the API that you use.

The following table shows the entities these two terms refer to per API:

Entity / API	SQL API	Cassandra API	Azure Cosmos DB API for MongoDB	Gremlin API	Table API
Item	Item	Row	Document	Node or edge	Item
Container	Container	Table	Collection	Graph	Table

Containers are grouped in databases, which are analogous to namespaces above containers. Containers are schema-agnostic, which means that no schema is enforced when adding items.

For more information, refer to Azure Cosmos DB resource model.

About the Cosmos DB Connector

The Cosmos DB connector is available starting with the 5.1.2 release. This connector uses the cdata.jdbc.cosmosdb.jar driver to connect to Cosmos DB items and containers and get data.

The Cosmos DB Connector utilizes the CData JDBC driver for Cosmos DB. In order to query a Cosmos DB resource using a SQL SELECT statement, the driver supports the configuration of various JDBC connection properties.

Important

The CData JDBC driver for Cosmos DB mainly supports the Cosmos DB SQL API.

For a comprehensive reference of the connector properties, refer to CData JDBC Driver for Cosmos DB.

The Cosmos DB connector supports the following Incorta specific functionality:

Feature	Supported
Chunking	✔
Data Agent
Encryption at Ingest
Incremental Load	✔
Multi-Source	✔
OAuth
Performance Optimized	✔
Remote
Single-Source	✔
Spark Extraction
Webhook Callbacks	✔

Note

The Cosmos DB connector supports two types of incremental loads: using a numeric column and using a date or timestamp column. To learn more, see Types of Incremental Load.

Cosmos DB connector requirements

The Cosmos DB connector requires the following:

An Account Endpoint: this is the Cosmos DB Account Uniform Resource Identifier (URI).
An Access Key: this is the primary key of the Cosmos DB account. It allows the connection to the Cosmos DB REST API to have access to the Cosmos DB resources in a particular account.

You can view your account URI and also view and manage your primary keys from within the Microsoft Azure Portal → Cosmos DB account → Settings → Keys.

For details, refer to Azure Cosmos DB Accounts and Secure access to data in Azure Cosmos DB.

The Cosmos DB connector installation

The Cosmos DB connector requires the deployment of cdata.jdbc.cosmosdb.jar file to the Incorta Node hosts of the Analytics Service and the Loader Service. A system administrator with root access to the host can deploy the JAR file. A CMC Administrator can restart the Incorta cluster.

Here are the steps to copy the JAR file to standalone Incorta cluster:

Secure copy the cdata.jdbc.cosmosdb.jar file to the host. Here is an example using scp:

INCORTA_NODE_HOST=100.101.102.103
cd ~/Downloads
scp -i ~/.ssh/host_pemkey.pem  cdata.jdbc.cosmosdb.jar incorta@${INCORTA_NODE_HOST}:/tmp/

Secure shell into the host

ssh -i ~/.ssh/host_pemkey.pem incorta@${INCORTA_NODE_HOST}

Copy the cdata.jdbc.cosmosdb.jar to the IncortaNode/runtime/lib/ directory in bash shell

sudo incorta
INCORTA_INSTALLATION_PATH=/home/incorta/IncortaAnalytics/

cp /tmp/cdata.jdbc.cosmosdb.jar $INCORTA_INSTALLATION_PATH/IncortaNode/runtime/lib/cdata.jdbc.cosmosdb.jar

Here are the steps to restart the standalone Incorta cluster:

Sign in to the Cluster Management Console (CMC) as the CMC Administrator.
In the Navigation bar, select Clusters.
Select the cluster name in the list.
In Details, select Restart.

Steps to connect a Cosmos DB data source and Incorta

To connect a Cosmos DB data source and Incorta, here are the high-level steps, tools, and procedures:

Create an external data source
Create a schema with the Schema Wizard
or, Create a schema with the Schema Designer
Load the schema
Explore the schema

Create an external data source

A Tenant Administrator (Super User), a user that belongs to a group with the SuperRole role, or a user that belongs to a group with the Schema Manager role can create an external data source for a given tenant.

Here are the steps to create an external data source with the Cosmos DB connector:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Data.
In the Action bar, select + New → Add Data.
In the Choose a Data Source dialog, in Query Service, select Cosmos DB.
In the New Data Source dialog, specify the applicable connector properties.
To test, select Test Connection.
Select Ok to save your changes.

The Cosmos DB connector properties

Here are the properties for the Cosmos DB connector:

Property	Control	Description
Data Source Name	text box	Enter the name of the data source
Account End Point	text box	Enter the Cosmos DB account URI
Account Key	text box	Enter the primary key of the Cosmos DB account
Generate Schema Files	drop down list	Select when schema files should be generated and saved. Available options are: ● Never (default) ● OnUse ● OnStart ● OnCreate
Schema Location	text box	Select a Generate Schema Files option other than Never to enable this property. Enter the path to the directory that you want to use to save these files. The default is `./home/incorta/schema`.
Flatten Objects	toggle	Optionally, enable this property to flatten object properties into a series of columns. Otherwise, one column of type string will represent the entire object with its properties. For details, refer to Automatic Schema Discovery → Flattening Objects.
Flatten Arrays	text box	Optionally, enter the number of columns (elements) you want to return from flattened array values. For details, refer to Automatic Schema Discovery → Flattening Arrays.
Use Connection Pooling	toggle	Optionally, enable this property to set the related properties for connection pooling. For more information, refer to Connection Pooling and Connection String Options.

Create a schema with the Schema Wizard

Here are the steps to create a Cosmos DB schema with the Schema Wizard:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Schema.
In the Action bar, select + New → Schema Wizard.
In (1) Choose a Source, specify the following:
- For Enter a name, enter the physical schema name.
- For Select a Datasource, select the Cosmos DB data source.
- Optionally, enter a description.
In the Schema Wizard footer, select Next.
In (2) Manage Tables, in the Data Panel, first select the name of the Data Source, and then check the Select All checkbox.
In the Schema Wizard footer, select Next.
In (3) Finalize, in the Schema Wizard footer, select Create Schema.

Create a schema with the Schema Designer

Here are the steps to create a Cosmos DB schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Schema.
In the Action bar, select + New → Create Schema.
In the Create Schema dialog, in Name, specify the physical schema name, and then select Save.
In Start adding tables to your schema, select CosmosDB.
In the Data Source dialog, specify the Cosmos DB table data source properties.
Select Add.
In the Table Editor, in the Table Summary section, enter the table name.
To save your changes, in the Action bar, select Done.

The Cosmos DB table data source properties

For a physical schema table in Incorta, you can define the following Cosmos-DB-specific data source properties as follows:

Property	Control	Description
Type	drop down list	The default is Cosmos DB
Data Source	drop down list	Select the Cosmos DB external data source
Incremental	toggle	Enable this property to configure the incremental load for this physical schema table. See Types of Incremental Load.
Incremental Extract Using	drop down list	Enable Incremental to configure this property. Select between Last Successful Extract Time and Maximum Value of a Column. See Types of Incremental Load.
Incremental Column	drop down list	Enable Incremental and select Maximum Value of a Column to configure this property. Select the column to check its maximum value. The Loader will track and use the greatest value or most recent timestamp for each incremental load operation.
Query	text box	Enter the SQL Select query to retrieve data from the Cosmos DB dataset
Update Query	text box	Enable Incremental to configure this property. Enter the SQL Select query to use during an incremental load. The query and update query should be of the same structure, that is, the same selected columns.
Incremental Field Type	drop down list	Enable Incremental to configure this property. Select the format of the incremental field. The available options vary according to the incremental type and the selected column, if any.
Fetch Size	text box	For performance improvement, define the number of records that will be retrieved from the dataset in each batch until all records are retrieved. The default is 5000.
Chunking Method	drop down list	Select the chunking method to allow for the parallel extraction of large tables. The default is No Chunking. There are two chunking methods: ● By Size of Chunking (Single Table) ● By Date/Timestamp
Chunk Size	text box	Select By Size of Chunking for the Chunking Method to set this property. Enter the number of records to extract in each chunk in relation to the Fetch Size. The default is 3 times the fetch size.
Order Column	drop down list	Select By Size of Chunking for the Chunking Method to set this property. Select a column in the source table you want to order by before chunking. It is typically an ID column and it must be numeric.
Upper Bound for Order Column	text box	Optionally, enter the maximum value for the order column
Lower Bound for Order Column	text box	Optionally, enter the minimum value for the order column
Order Column [Date/Timestamp]	drop down list	Select By Date/Timestamp for the Chunking Method to set this property. Select a column in the source table you want to order by before chunking. It should be a Date/Timestamp column.
Chunk Period	drop down list	Select the chunk period that will be used in dividing chunks: ● Daily ● Weekly ● Monthly ● Yearly ● Custom
Number of days	text box	Select Custom for the Chunk Period to set this property. Enter the chunking period in days.
Callback	toggle	Enable this property to enable post extraction callback. This enables callback on the data source dataset(s) by invoking a certain callback URL with parameters containing details about the load job.
Callback URL	text box	Enable Callback to configure this property. Specify the callback URL.

View the schema diagram with the Schema Diagram Viewer

Here are the steps to view the physical schema diagram using the Schema Diagram Viewer:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Schema.
In the list of schemas, select the Cosmos DB schema.
In the Schema Designer, in the Action bar, select Diagram.

Load the schema

Here are the steps to perform a Full Load of the Cosmos DB schema using the Schema Designer:

Sign in to the Incorta Direct Data Platform™.
In the Navigation bar, select Schema.
In the list of schemas, select the Cosmos DB schema.
In the Schema Designer, in the Action bar, select Load → Load Now → Full.
To review the load status, in Last Load Status, select the date.

Explore the schema

With the full load of the Cosmos DB schema complete, you can use the Analyzer to explore the schema, create your first insight, and save the insight to a new dashboard.

To open the Analyzer from the schema, follow these steps:

In the Navigation bar, select Schema.
In the Schema Manager, in the List view, select the Cosmos DB schema.
In the Schema Designer, in the Action bar, select Explore Data.

For more information about how to use the Analyzer to create insights, see Analyzer and Visualizations.

Additional Considerations

Types of Incremental Load

You can enable incremental load for a Cosmos DB table data. There are two types of incremental extracts: Last Successful Extract Time and Maximum Value of a Column

Last Successful Extract Time

In this type, the Loader Service fetches updates since the last time the table was loaded. This is determined by the difference between the current time and the database timestamp.

Maximum Value of a Column

The column-based strategy depends on an extra column called Incremental Column in each table. The Cosmos DB connector supports both timestamp and numeric columns. A timestamp column is of the type date or timestamp. A numeric column is of the type int or long.

Important

Changing the incremental load strategy requires a full load to ensure data integrity.

Incremental Load Example

In this example, the invoices table must contain a column of the type Date or Timestamp in order to load the table incrementally with a last successful extract time strategy. In this case, the name of the date column is ModifiedDate and the format of the column is Timestamp.

Here are the data source property values for this example:

Incremental Load: enabled
Query: SELECT * FROM invoices
Update Query: SELECT * FROM invoices WHERE ModifiedDate > ?

Note

? is a variable in the update query that contains the last schema refresh date.

Incremental Field Type: Timestamp

Note

When defining an update query for an incremental load, you are able to use the ? reference character. The ? character will be replaced with the last incremental reference to construct a valid query to the database. The ? reference character is not valid in a standard query.

Valid Query Types

When creating a query for the Cosmos DB connector, only SELECT statements are valid.

Connectors → Cosmos DB

About Cosmos DB

Cosmos DB resource model

About the Cosmos DB Connector

Cosmos DB connector requirements

The Cosmos DB connector installation

Steps to connect a Cosmos DB data source and Incorta

Create an external data source

The Cosmos DB connector properties

Create a schema with the Schema Wizard

Create a schema with the Schema Designer

The Cosmos DB table data source properties

View the schema diagram with the Schema Diagram Viewer

Load the schema

Explore the schema

Additional Considerations

Types of Incremental Load

Last Successful Extract Time

Maximum Value of a Column

Incremental Load Example

Valid Query Types

Content