Concepts → Materialized View
About a Materialized View
A materialized view (MV) is an entity object in a physical schema that defines an Apache Spark job. The definition of the job consists of a script in a Spark compatible language. The job typically reads data for existing entity objects in a physical schema using the related Apache Parquet files in shared storage. During a load job of the parent physical schema or the materialized view, the related Spark job saves the resulting dataframe to shared storage in a Parquet file.
In order to both successfully define the script for a materialized view and to load a materialized view, Apache Spark must be running and available in the Incorta cluster.
A materialized view depends on existing entity objects in a physical schema (and in certain cases, can include runtime business views). This dependency defines a materialized view as a type of derived table. As a derived table, a materialized view processes already extracted data and typically functions to transform and enrich data within a given physical schema.
Starting with 2023.7.0, you can view the data lineage of materialized views in the Data Lineage Viewer v2 that displays a diagram with the entities referenced in the MV (upstream lineage) and entities where the MV is referenced (downstream lineage). Additionally, you can view the data lineage per column or formula column in the MV.
Properties of a Materialized View
The properties of a materialized view are:
- Name
- Data Source(s)
- Columns
- Formula Columns
- Runtime Security Filters
- Parent Joins
- Child Joins
- Load Filter
Name
A materialized view name must adhere to the following name validation rules:
- Must be unique to the physical schema
- Must be between 1 and 128 characters in length
- Must begin with an alpha character, lower or upper case, (
_
), or ($
) - After the first alpha character, can contain zero or more alphanumeric characters in lower, upper, or mixed case
- After the first alpha character, can contain zero or more underscore (
_
) or Dollar sign ($
) characters - Besides underscore (
_
) and Dollar sign ($
), cannot contain special characters, symbols, or spaces
The name for a materialized view is immutable.
Data Source
The data source for a materialized view is Apache Spark. A materialized view only supports a single-source dataset. A schema developer can combine multiple dataframes with the script of a materialized view.
Starting with the 2021.4.2 release, you can define the number of the data frame partitions resulting from loading a materialized view. In the Data Source dialog, you can add the spark.dataframe.partitions
property and set the number of the data frame partitions as appropriate. If required, the Loader Service will perform either a coalesce or reparation operation to create the required number of data frame partitions.
Type
The data source type should be set as Materialized View. Changing the data source type will change the type of table created.
Language
Following are the languages available for writing your script:
- Spark SQL
- Spark Python
- Spark Scala
- Incorta PostgreSQL
Only when using Incorta PostgreSQL are you able to reference one or more business schema views or aliases.
Starting with the 2021.4.2 release, you can generate a read-only Spark SQL query that is equivalent to your PostgreSQL query, and then copy it by selecting Copy to Clipboard in the Edit Query dialog of a PostgreSQL Materialized View table. Also, after editing a PostgreSQL query, you can select Refresh to show the updated Spark SQL query.
Incremental
A materialized view supports Incremental loads. You can enable incremental consumption of rows from the data source dialog. When enabled, the Incremental Script dialog will become available. A schema developer must define the materialized view script for the incremental load.
An incremental load script should differ from the normal script. An incremental load job necessitates background compaction of Parquet files. A normal script run as an incremental load job may generate an unnecessary workload and Parquet compaction, causing load performance issues.
For more details, see Additional Considerations → Incremental loads strategies
Script
You write the script for a materialized view using two different editor options:
Edit Query
The Edit Query window will provide syntax highlighting and formatting capabilities. Incorta will validate the script when completing the data source configuration.
Edit in Notebook
The Notebook Editor allows you to iteratively build, validate, and test queries within the editor. You can enable the Notebook Editor from the Cluster Management Console (CMC). The Notebook Editor is an Incorta Labs feature in your tenant configurations.
The Notebook Editor is not available when using Incorta PostgreSQL.
IncortaML
From the Notebook Editor, you can use IncortaML (Incorta Machine Learning) to add machine learning models to your materialized view. You can utilize IncortaML to train models using regression and classification modeling.
Properties
When you create a materialized view, you can define the Spark properties for the load job. A CMC Administrator sets the default properties for a materialized view in the tenant’s configuration.
Starting with the 2022.1.0 release, you can submit Spark jobs to the Spark cluster by setting the cluster
deploy mode at the MV level for the spark.submit.deployMode
property in the Data Source dialog, or at the server level in Spark Integration configurations in the CMC. If the MV deploy mode is defined, it overrides the server configuration, otherwise, the server deploy mode is applied.
You should only edit the Spark properties of the materialized view to manage specific memory constraints and with an understanding of the Spark configuration.
Following are configurable Spark properties:
Property | Default | Description | Notes |
---|---|---|---|
spark.driver.memory | 1g | Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in MiB unless otherwise specified (e.g. 1g, 2g). This is the memory allocated for the Spark logical plan and any Spark actions that transfer the data to the driver. | |
spark.executor.memory | 1g | The amount of memory to use per executor process, in MiB unless otherwise specified. | |
spark.cores.max | No default | The maximum amount of CPU cores to request for the application from the cluster. | |
spark.executor.cores | The default is 1 with Spark configured using YARN. | The number of cores to use on each executor. | Set equal to spark.cores.max. |
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. | Initially set the shuffle partition equal to cores.max. As the data increases, you can increase the partitions. |
spark.driver.maxResultSize | 1g | Limit of the total size of serialized results of all partitions for each Spark action in bytes. | Set to at least 1M or 0 for unlimited. Spark will abort the job if the total size is above this limit. Having a high limit may cause out-of-memory errors in the driver (depending on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors. |
spark.memory.fraction | 0.6 | Fraction of heap space used for execution and storage. The lower the value, the more frequently spills and cached data eviction can occur. | |
spark.serializer | org.apache.spark.serializer. JavaSerializer | Used for serializing objects that will be sent over the network or need to be cached in serialized form. | Do not configure this parameter, if you are using RDDs/DataFrame over custom data types that do not have proper implementation for Kryo serialization (e.g. avoid using this parameter for machine learning Materialized Views if you have RDDs of custom data types "non-primitive") |
Columns
Once you create a script, the referenced columns will be visible in the Table Editor. See Concepts → Materialized View Column for information on the column properties.
Referenceable Columns
Following are the possible types of referenceable columns in a materialized view script:
Spark Referenceable refers to the following languages used with Spark: Spark SQL, Spark Python, and Spark Scala.
Table or View Type | Column Type | Spark Referenceable | Incorta PostgreSQL Referenceable | Note |
---|---|---|---|---|
Alias | Data-backed column | No | Yes | |
Alias | Formula column | No | Yes | |
Business Schema View | Data-backed column | No | Yes | |
Business Schema View | Formula column | No | Yes | Session Variables are not referenceable. |
Incorta Analyzer Table | Data-backed column | Yes | Yes | Only Incorta Analyzer tables in another physical schema |
Incorta Analyzer Table | Formula column | Yes | Yes | Only Incorta Analyzer tables in another physical schema |
Incorta SQL Table | Data-backed column | Yes | Yes | Only Incorta SQL tables in another physical schema. The Incorta SQL Table is an Incorta Labs feature and is enabled in the CMC. |
Incorta Analyzer View | Data-backed column | No | No | |
Incorta Analyzer View | Formula column | No | No | |
Materialized View | Data-backed column | Yes | Yes | |
Materialized View | Formula column | No | Yes | |
Physical Schema Table | Data-backed column | Yes | Yes | |
Physical Schema Table | Formula column | No | Yes |
A materialized view column cannot reference a column in an Incorta Analyzer table or Incorta SQL table that exists in the same physical schema as the materialized view.
In addition, a materialized view column cannot reference an Incorta Analyzer table column or Incorta SQL table column that references a business schema view.
Formula Columns
You create a materialized view formula column using the Formula Builder. Materialized view formula columns have several restrictions related to the materialized view’s join configurations. For more information see Concepts → Materialized View Formula Column.
To add a formula column to a materialized view that you have just created, you must save the materialized view first.
Global Variables
Starting the 2022.1.0 release, you can reference global variables in MVs when you use the Notebook Editor or the Query Builder to add or edit the MV script (or incremental script) using any supported language.
To reference a global variable, precede the variable name with a double dollar sign, for example, $$gvar_name
. When referencing string, date, and timestamp global variables, you must use single or double quotes depending upon the MV script language. For Spark Scala, use double quotes, for other languages, use single quotes, for example '$$string_gvar_name'
.
The Global variables referenced in MVs are evaluated when validating the MV data source and when loading them. Thus, when you edit the value of a global variable, you must perform a full load of the related MVs.
Runtime Security Filters
As a schema developer, you can apply one or more runtime security filters to restrict row access to a materialized view. Any dependent object such as a runtime business view, dashboard insight, dashboard filter, or internal session variable will automatically apply the runtime security filter. To learn more about a physical schema table runtime security filters, see Concepts → Runtime Security Filter.
Parent Joins
The materialized view is the parent table in a join relationship to another entity object in the same or different physical schema. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.
A parent table typically has one key column or composite key columns. In both cases, the key column functions as a primary key. Where there is a key column, the join condition typically contains an equals =
join operator.
It is possible to create a join from a child to parent without using a defined key column in the parent table. A table without a key column allows for duplicate values. If the parent table contains duplicate non-key rows, the Loader Service randomly selects a parent table row based on the non-key join. To help simulate a key for the parent table, consider using one or more conditions in the join definition such as >
greater than and <
less than. Multiple join operators may significantly narrow join values from the parent table so as to create row uniqueness. A join condition with multiple join operators requires internal sorting which may require high calculation costs for the join path.
Whenever possible, create a key column or a composite key column for a parent table. Avoid making a join to a parent table that does not have a key column or composite key columns.
Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.
Child Joins
The materialized view is the child table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. A child table typically has a column that serves as a foreign key. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.
Join inheritance
Materialized views do not inherit join relationships from source entities and will need to have their own join relationships created. The data sources available to a materialized view formula column are directly related to the child joins of the materialized view.
Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.
Load Filter
For a materialized view, you can specify a load filter in the Table Editor. A load filter reduces the number of rows that the Analytics Service loads into memory from shared storage, but does not affect how many rows the Loader Service extracts from the data source of the table.
Additional Considerations
Script referencing a schema entity
A script referencing a column in a physical schema or business schema entity must identify the schema and entity. As an example, a SQL query would identify the schema and entity as SCHEMA.ENTITY
.
Incremental load strategies
In incremental load jobs, the Loader Service instructs Spark to fetch only new and updated records. Starting with the 2021.4.3 release, MVs support two different incremental load strategies: the MV last successful transformation time and the maximum value of a column in the MV.
The last successful transformation time is the only strategy available in releases before 2021.4.3. The Loader Service evaluates the ?
in the Incremental Script query to the time of the last successful transformation of the MV. In this strategy, some records may be skipped in the next incremental load jobs if the MV’s source object is updated during the transformation time.
The maximum value of a column is the newly supported strategy in MVs. This strategy depends on comparing the maximum available value of a column in the MV with the values of a column in the MV’s source object. You specify this MV column in the MV data source properties. This column can be a timestamp/date or numeric column. The Loader Service resolves the ?
in the Incremental Script to the maximum value available in the selected MV column before passing the script to Spark.
Here is an example of an MV script and incremental script using Spark SQL:
- Script:
SELECT * FROM Sales.invoices
- Incremental Script:
SELECT * FROM Sales.invoices WHERE Sales.invoices.ModifiedDate > ?
And the same example using Spark Python:
Script:
df = read("Sales.invoices")save(df)Incremental Script:
df = read("Sales.invoices")df = df.filter("ModifiedDate > ?")save(df)
Query methods in Notebook editor
Starting in 2022.10.0, you can query the physical schema and business schema from the Notebook editor using the following methods.
Method | Parameters | Description |
---|---|---|
incorta_sql() | String | Using an SQL query as a string, you can create a query and extract data directly from the physical schema. |
incorta_sql_pg() | String | Using a PostreSQL query as a string, you can create a query to extract data directly from the physical schema or business schema. |
Take special note the incorta_sql_pg() method takes PostreSQL queries and can be used to extract data from both the physical and business schema.
Script restrictions
Schema restrictions
When you create your materialized view using Incorta PostgreSQL, you are able to reference a business schema view or an alias. These options are not available when using the other provided spark languages.
Session variables
A materialized view cannot reference business schema entities that contain session variables. Attempting to do so will result in an error when attempting to validate the script.
Spark jobs concerning a materialized view
When you modify the Spark properties of a materialized view, you may need to see the spark job execution of the materialized view. Any Spark job changes should be made in coordination with your System Administrator.
You can view the Spark job processes at the following URL: http://<server address>:9091
When you validate your script in the query editor, the materialized view will display as one spark job.
While you are running the Notebook Editor in a materialized view, Spark will display the Notebook Editor as a running application. Once you finish using the Notebook Editor and you validate the materialized view script, you will see a second job run when the materialized view is validated.