Concepts → Materialized View

About a Materialized View

A materialized view (MV) is an entity object in a physical schema that defines an Apache Spark job. The definition of the job consists of a script in a Spark compatible language. The job typically reads data for existing entity objects in a physical schema using the related Apache Parquet files in shared storage. During a load job of the parent physical schema or the materialized view, the related Spark job saves the resulting dataframe to shared storage in a Parquet file.

Important

In order to both successfully define the script for a materialized view and to load a materialized view, Apache Spark must be running and available in the Incorta cluster.

A materialized view depends on existing entity objects in a physical schema (and in certain cases, can include runtime business views). This dependency defines a materialized view as a type of derived table. As a derived table, a materialized view processes already extracted data and typically functions to transform and enrich data within a given physical schema.

Object data lineage

Starting with 2023.7.0, you can view the data lineage of materialized views in the Data Lineage Viewer v2 that displays a diagram with the entities referenced in the MV (upstream lineage) and entities where the MV is referenced (downstream lineage). Additionally, you can view the data lineage per column or formula column in the MV.

Properties of a Materialized View

The properties of a materialized view are:

Name
Data Source(s)
Columns
Formula Columns
Runtime Security Filters
Parent Joins
Child Joins
Load Filter

Name

A materialized view name must adhere to the following name validation rules:

Must be unique to the physical schema
Must be between 1 and 128 characters in length
Must begin with an alpha character, lower or upper case, (_), or ($)
After the first alpha character, can contain zero or more alphanumeric characters in lower, upper, or mixed case
After the first alpha character, can contain zero or more underscore (_) or Dollar sign ($) characters
Besides underscore (_) and Dollar sign ($), cannot contain special characters, symbols, or spaces

Important

The name for a materialized view is immutable.

Data Source

The data source for a materialized view is Apache Spark. A materialized view only supports a single-source dataset. A schema developer can combine multiple dataframes with the script of a materialized view.

Starting with the 2021.4.2 release, you can define the number of the data frame partitions resulting from loading a materialized view. In the Data Source dialog, you can add the spark.dataframe.partitions property and set the number of the data frame partitions as appropriate. If required, the Loader Service will perform either a coalesce or reparation operation to create the required number of data frame partitions.

Type

The data source type should be set as Materialized View. Changing the data source type will change the type of table created.

Language

Following are the languages available for writing your script:

Spark SQL
Spark Python
Spark Scala
Incorta PostgreSQL

Note

Only when using Incorta PostgreSQL are you able to reference one or more business schema views or aliases.

Note

Starting with the 2021.4.2 release, you can generate a read-only Spark SQL query that is equivalent to your PostgreSQL query, and then copy it by selecting Copy to Clipboard in the Edit Query dialog of a PostgreSQL Materialized View table. Also, after editing a PostgreSQL query, you can select Refresh to show the updated Spark SQL query.

Incremental

A materialized view supports Incremental loads. You can enable incremental consumption of rows from the data source dialog. When enabled, the Incremental Script dialog will become available. A schema developer must define the materialized view script for the incremental load.

Warning

An incremental load script should differ from the normal script. An incremental load job necessitates background compaction of Parquet files. A normal script run as an incremental load job may generate an unnecessary workload and Parquet compaction, causing load performance issues.

For more details, see Additional Considerations → Incremental loads strategies

Script

You write the script for a materialized view using two different editor options:

Edit Query

The Edit Query window will provide syntax highlighting and formatting capabilities. Incorta will validate the script when completing the data source configuration.

Edit in Notebook

The Notebook Editor allows you to iteratively build, validate, and test queries within the editor. You can enable the Notebook Editor from the Cluster Management Console (CMC). The Notebook Editor is an Incorta Labs feature in your tenant configurations.

Note

The Notebook Editor is not available when using Incorta PostgreSQL.

IncortaML

From the Notebook Editor, you can use IncortaML (Incorta Machine Learning) to add machine learning models to your materialized view. You can utilize IncortaML to train models using regression and classification modeling.

Properties

When you create a materialized view, you can define the Spark properties for the load job. A CMC Administrator sets the default properties for a materialized view in the tenant’s configuration.

Starting with the 2022.1.0 release, you can submit Spark jobs to the Spark cluster by setting the cluster deploy mode at the MV level for the spark.submit.deployMode property in the Data Source dialog, or at the server level in Spark Integration configurations in the CMC. If the MV deploy mode is defined, it overrides the server configuration, otherwise, the server deploy mode is applied.

Recommendation

You should only edit the Spark properties of the materialized view to manage specific memory constraints and with an understanding of the Spark configuration.

Following are configurable Spark properties:

Property	Default	Description	Notes
spark.driver.memory	1g	Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in MiB unless otherwise specified (e.g. 1g, 2g). This is the memory allocated for the Spark logical plan and any Spark actions that transfer the data to the driver.
spark.executor.memory	1g	The amount of memory to use per executor process, in MiB unless otherwise specified.
spark.cores.max	No default	The maximum amount of CPU cores to request for the application from the cluster.
spark.executor.cores	The default is 1 with Spark configured using YARN.	The number of cores to use on each executor.	Set equal to spark.cores.max.
spark.sql.shuffle.partitions	200	Configures the number of partitions to use when shuffling data for joins or aggregations.	Initially set the shuffle partition equal to cores.max. As the data increases, you can increase the partitions.
spark.driver.maxResultSize	1g	Limit of the total size of serialized results of all partitions for each Spark action in bytes.	Set to at least 1M or 0 for unlimited. Spark will abort the job if the total size is above this limit. Having a high limit may cause out-of-memory errors in the driver (depending on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors.
spark.memory.fraction	0.6	Fraction of heap space used for execution and storage. The lower the value, the more frequently spills and cached data eviction can occur.
spark.serializer	org.apache.spark.serializer. JavaSerializer	Used for serializing objects that will be sent over the network or need to be cached in serialized form.	Do not configure this parameter, if you are using RDDs/DataFrame over custom data types that do not have proper implementation for Kryo serialization (e.g. avoid using this parameter for machine learning Materialized Views if you have RDDs of custom data types "non-primitive")

Columns

Once you create a script, the referenced columns will be visible in the Table Editor. See Concepts → Materialized View Column for information on the column properties.

Referenceable Columns

Following are the possible types of referenceable columns in a materialized view script:

Note

Spark Referenceable refers to the following languages used with Spark: Spark SQL, Spark Python, and Spark Scala.

Table or View Type	Column Type	Spark Referenceable	Incorta PostgreSQL Referenceable	Note
Alias	Data-backed column	No	Yes
Alias	Formula column	No	Yes
Business Schema View	Data-backed column	No	Yes
Business Schema View	Formula column	No	Yes	Session Variables are not referenceable.
Incorta Analyzer Table	Data-backed column	Yes	Yes	Only Incorta Analyzer tables in another physical schema
Incorta Analyzer Table	Formula column	Yes	Yes	Only Incorta Analyzer tables in another physical schema
Incorta SQL Table	Data-backed column	Yes	Yes	Only Incorta SQL tables in another physical schema. The Incorta SQL Table is an Incorta Labs feature and is enabled in the CMC.
Incorta Analyzer View	Data-backed column	No	No
Incorta Analyzer View	Formula column	No	No
Materialized View	Data-backed column	Yes	Yes
Materialized View	Formula column	No	Yes
Physical Schema Table	Data-backed column	Yes	Yes
Physical Schema Table	Formula column	No	Yes

Warning

A materialized view column cannot reference a column in an Incorta Analyzer table or Incorta SQL table that exists in the same physical schema as the materialized view.

In addition, a materialized view column cannot reference an Incorta Analyzer table column or Incorta SQL table column that references a business schema view.

Formula Columns

You create a materialized view formula column using the Formula Builder. Materialized view formula columns have several restrictions related to the materialized view’s join configurations. For more information see Concepts → Materialized View Formula Column.

Note

To add a formula column to a materialized view that you have just created, you must save the materialized view first.

Global Variables

Starting the 2022.1.0 release, you can reference global variables in MVs when you use the Notebook Editor or the Query Builder to add or edit the MV script (or incremental script) using any supported language.

To reference a global variable, precede the variable name with a double dollar sign, for example, $$gvar_name. When referencing string, date, and timestamp global variables, you must use single or double quotes depending upon the MV script language. For Spark Scala, use double quotes, for other languages, use single quotes, for example '$$string_gvar_name'.

Important

The Global variables referenced in MVs are evaluated when validating the MV data source and when loading them. Thus, when you edit the value of a global variable, you must perform a full load of the related MVs.

Runtime Security Filters

As a schema developer, you can apply one or more runtime security filters to restrict row access to a materialized view. Any dependent object such as a runtime business view, dashboard insight, dashboard filter, or internal session variable will automatically apply the runtime security filter. To learn more about a physical schema table runtime security filters, see Concepts → Runtime Security Filter.

Parent Joins

The materialized view is the parent table in a join relationship to another entity object in the same or different physical schema. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

A parent table typically has one key column or composite key columns. In both cases, the key column functions as a primary key. Where there is a key column, the join condition typically contains an equals = join operator.

It is possible to create a join from a child to parent without using a defined key column in the parent table. A table without a key column allows for duplicate values. If the parent table contains duplicate non-key rows, the Loader Service randomly selects a parent table row based on the non-key join. To help simulate a key for the parent table, consider using one or more conditions in the join definition such as > greater than and < less than. Multiple join operators may significantly narrow join values from the parent table so as to create row uniqueness. A join condition with multiple join operators requires internal sorting which may require high calculation costs for the join path.

Recommendation

Whenever possible, create a key column or a composite key column for a parent table. Avoid making a join to a parent table that does not have a key column or composite key columns.

Warning

Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.

Child Joins

The materialized view is the child table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. A child table typically has a column that serves as a foreign key. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

Join inheritance

Materialized views do not inherit join relationships from source entities and will need to have their own join relationships created. The data sources available to a materialized view formula column are directly related to the child joins of the materialized view.

Warning

Load Filter

For a materialized view, you can specify a load filter in the Table Editor. A load filter reduces the number of rows that the Analytics Service loads into memory from shared storage, but does not affect how many rows the Loader Service extracts from the data source of the table.

Additional Considerations

Script referencing a schema entity

A script referencing a column in a physical schema or business schema entity must identify the schema and entity. As an example, a SQL query would identify the schema and entity as SCHEMA.ENTITY.

Incremental load strategies

In incremental load jobs, the Loader Service instructs Spark to fetch only new and updated records. Starting with the 2021.4.3 release, MVs support two different incremental load strategies: the MV last successful transformation time and the maximum value of a column in the MV.

The last successful transformation time is the only strategy available in releases before 2021.4.3. The Loader Service evaluates the ? in the Incremental Script query to the time of the last successful transformation of the MV. In this strategy, some records may be skipped in the next incremental load jobs if the MV’s source object is updated during the transformation time.

The maximum value of a column is the newly supported strategy in MVs. This strategy depends on comparing the maximum available value of a column in the MV with the values of a column in the MV’s source object. You specify this MV column in the MV data source properties. This column can be a timestamp/date or numeric column. The Loader Service resolves the ? in the Incremental Script to the maximum value available in the selected MV column before passing the script to Spark.

Here is an example of an MV script and incremental script using Spark SQL:

Script: SELECT * FROM Sales.invoices
Incremental Script: SELECT * FROM Sales.invoices WHERE Sales.invoices.ModifiedDate > ?

And the same example using Spark Python:

Script:

df = read("Sales.invoices")
save(df)

Incremental Script:

df = read("Sales.invoices")
df = df.filter("ModifiedDate > ?")
save(df)

Query methods in Notebook editor

Starting in 2022.10.0, you can query the physical schema and business schema from the Notebook editor using the following methods.

Method	Parameters	Description
incorta_sql()	String	Using an SQL query as a string, you can create a query and extract data directly from the physical schema.
incorta_sql_pg()	String	Using a PostreSQL query as a string, you can create a query to extract data directly from the physical schema or business schema.

Note

Take special note the incorta_sql_pg() method takes PostreSQL queries and can be used to extract data from both the physical and business schema.

Script restrictions

Schema restrictions

When you create your materialized view using Incorta PostgreSQL, you are able to reference a business schema view or an alias. These options are not available when using the other provided spark languages.

Session variables

A materialized view cannot reference business schema entities that contain session variables. Attempting to do so will result in an error when attempting to validate the script.

Spark jobs concerning a materialized view

When you modify the Spark properties of a materialized view, you may need to see the spark job execution of the materialized view. Any Spark job changes should be made in coordination with your System Administrator.

You can view the Spark job processes at the following URL: http://<server address>:9091

When you validate your script in the query editor, the materialized view will display as one spark job.

While you are running the Notebook Editor in a materialized view, Spark will display the Notebook Editor as a running application. Once you finish using the Notebook Editor and you validate the materialized view script, you will see a second job run when the materialized view is validated.

Concepts → Materialized View

About a Materialized View

Properties of a Materialized View

Name

Data Source

Type

Language

Incremental

Script

Edit Query

Edit in Notebook

IncortaML

Properties

Columns

Referenceable Columns

Formula Columns

Global Variables

Runtime Security Filters

Parent Joins

Child Joins

Join inheritance

Load Filter

Additional Considerations

Script referencing a schema entity

Incremental load strategies

Query methods in Notebook editor

Script restrictions

Schema restrictions

Session variables

Spark jobs concerning a materialized view

Content