Concepts → Materialized View
About a Materialized View
A materialized view is an entity object in a physical schema that defines an Apache Spark job. The definition of the job consists of a script in a Spark compatible language. The job typically reads data for existing entity objects in a physical schema using the related Apache Parquet files in shared storage. During a load job of the parent physical schema or the materialized view, the related Spark job saves the resulting dataframe to shared storage in a Parquet file.
In order to both successfully define the script for a materialized view and to load a materialized view, Apache Spark must be running and available in the Incorta cluster.
A materialized view depends on existing entity objects in a physical schema (and in certain cases, can include runtime business views). This dependency defines a materialized view as a type of derived table. As a derived table, a materialized view processes already extracted data and typically functions to transform and enrich data within a given physical schema.
Properties of a Materialized View
The properties of a materialized view are:
- Name
- Data Source(s)
- Columns
- Formula Columns
- Runtime Security Filters
- Parent Joins
- Child Joins
- Load Filter
Name
A materialized view name must adhere to the following name validation rules:
- Must be unique to the physical schema
- Must be between 1 and 128 characters in length
- Must begin with an alpha character, lower or upper case, (
_
), or ($
) - After the first alpha character, can contain zero or more alphanumeric characters in lower, upper, or mixed case
- After the first alpha character, can contain zero or more underscore (
_
) or Dollar sign ($
) characters - Besides underscore (
_
) and Dollar sign ($
), cannot contain special characters, symbols, or spaces
The name for a materialized view is immutable.
Data Source
The data source for a materialized view is Apache Spark. A materialized view only supports a single-source dataset. A schema developer can combine multiple dataframes with the script of a materialized view.
Type
The data source type should be set as Materialized View. Changing the data source type will change the type of table created.
Language
Following are the languages available for writing your script:
- Spark SQL
- Spark Python
- Spark Scala
- Spark R
- Incorta PostgreSQL
Only when using Incorta PostgreSQL are you able to reference one or more business schema views or aliases.
Incremental
A materialized view supports Incremental loads. You can enable incremental consumption of rows from the data source dialog. When enabled, the Incremental Script dialog will become available. A schema developer must define the materialized view script for the incremental load.
An incremental load script should differ from the normal script. An incremental load job necessitates background compaction of Parquet files. A normal script run as an incremental load job may generate an unnecessary workload and Parquet compaction, causing load performance issues.
Script
You write the script for a materialized view using two different editor options:
Edit query
The Edit Query window will provide syntax highlighting and formatting capabilities. Incorta will validate the script when completing the data source configuration.
Edit in notebook
The notebook editor allows you to iteratively build, validate, and test queries within the editor. You can enable the notebook editor from the Cluster Management Console (CMC). The notebook editor is an Incorta Labs feature in your tenant configurations.
The notebook editor is not available when using Incorta PostgreSQL.
IncortaML
From the notebook editor, you can use IncortaML (Incorta Machine Learning) to add machine learning models to your materialized view. You can utilize IncortaML to train models using regression and classification modeling.
Properties
When you create a materialized view, you can define the Spark properties for the load job. A CMC Administrator sets the default properties for a materialized view in the tenant’s configuration.
You should only edit the Spark properties of the materialized view to manage specific memory constraints and with an understanding of the Spark configuration.
Following are configurable Spark properties:
Property | Default | Description | Notes |
---|---|---|---|
spark.driver.memory | 1g | Amount of memory to use for the driver process, i.e. where SparkContext is initialized, in MiB unless otherwise specified (e.g. 1g, 2g). This is the memory allocated for the Spark logical plan and any Spark actions that transfer the data to the driver. | |
spark.executor.memory | 1g | The amount of memory to use per executor process, in MiB unless otherwise specified. | |
spark.cores.max | No default | The maximum amount of CPU cores to request for the application from the cluster. | |
spark.executor.cores | The default is 1 with Spark configured using YARN. | The number of cores to use on each executor. | Set equal to spark.cores.max. |
spark.sql.shuffle.partitions | 200 | Configures the number of partitions to use when shuffling data for joins or aggregations. | Initially set the shuffle partition equal to cores.max. As the data increases, you can increase the partitions. |
spark.driver.maxResultSize | 1g | Limit of the total size of serialized results of all partitions for each Spark action in bytes. | Set to at least 1M or 0 for unlimited. Spark will abort the job if the total size is above this limit. Having a high limit may cause out-of-memory errors in the driver (depending on spark.driver.memory and memory overhead of objects in JVM). Setting a proper limit can protect the driver from out-of-memory errors. |
spark.memory.fraction | 0.6 | Fraction of heap space used for execution and storage. The lower the value, the more frequently spills and cached data eviction can occur. | |
spark.serializer | org.apache.spark.serializer. JavaSerializer | Used for serializing objects that will be sent over the network or need to be cached in serialized form. | Do not configure this parameter, if you are using RDDs/DataFrame over custom data types that do not have proper implementation for Kryo serialization (e.g. avoid using this parameter for machine learning Materialized Views if you have RDDs of custom data types "non-primitive") |
Columns
Once you create a script, the referenced columns will be visible in the Table Editor. See Concepts → Materialized View Column for information on the column properties.
Referenceable Columns
Following are the possible types of referenceable columns in a materialized view script:
Spark Referenceable refers to the following languages used with Spark: Spark SQL, Spark Python, Spark Scala, and Spark R.
Table or View Type | Column Type | Spark Referenceable | Incorta PostgreSQL Referenceable | Note |
---|---|---|---|---|
Alias | Data-backed column | No | Yes | |
Alias | Formula column | No | Yes | |
Business Schema View | Data-backed column | No | Yes | |
Business Schema View | Formula column | No | Yes | Session Variables are not referenceable. |
Incorta Analyzer Table | Data-backed column | Yes | Yes | Only Incorta Analyzer tables in another physical schema |
Incorta Analyzer Table | Formula column | Yes | Yes | Only Incorta Analyzer tables in another physical schema |
Incorta SQL Table | Data-backed column | Yes | Yes | Only Incorta SQL tables in another physical schema. The Incorta SQL Table is an Incorta Labs feature and is enabled in the CMC. |
Incorta View | Data-backed column | No | No | |
Incorta View | Formula column | No | No | |
Materialized View | Data-backed column | Yes | Yes | |
Materialized View | Formula column | No | Yes | |
Physical Schema Table | Data-backed column | Yes | Yes | |
Physical Schema Table | Formula column | No | Yes |
Before the 5.1.2 release, a materialized view column cannot reference a column in an Incorta Analyzer table or Incorta SQL table. The 5.1.2 release introduced saving these tables to parquet files rather than snapshot ddm files. This allows Spark to have access to the data of these tables, and accordingly, a materialized view column can reference columns in these Incorta tables. However, a materialized view column cannot reference a column in an Incorta Analyzer table or Incorta SQL table that exists in the same physical schema as the materialized view.
In addition, a materialized view column cannot reference an Incorta Analyzer table column or Incorta SQL table column that references a business schema view.
Formula Columns
You create a materialized view formula column using the Formula Builder. Materialized view formula columns have several restrictions related to the materialized view’s join configurations. For more information see Concepts → Materialized View Formula Column.
To add a formula column to a materialized view that you have just created, you must save the materialized view first.
Runtime Security Filters
As a schema developer, you can apply one or more runtime security filters to restrict row access to a materialized view. Any dependent object such as a runtime business view, dashboard insight, dashboard filter, or internal session variable will automatically apply the runtime security filter. To learn more about a physical schema table runtime security filters, see Concepts → Runtime Security Filter.
Parent Joins
The materialized view is the parent table in a join relationship to another entity object in the same or different physical schema. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.
A parent table typically has one key column or composite key columns. In both cases, the key column functions as a primary key. Where there is a key column, the join condition typically contains an equals =
join operator.
It is possible to create a join from a child to parent without using a defined key column in the parent table. A table without a key column allows for duplicate values. If the parent table contains duplicate non-key rows, the Loader Service randomly selects a parent table row based on the non-key join. To help simulate a key for the parent table, consider using one or more conditions in the join definition such as >
greater than and <
less than. Multiple join operators may significantly narrow join values from the parent table so as to create row uniqueness. A join condition with multiple join operators requires internal sorting which may require high calculation costs for the join path.
Whenever possible, create a key column or a composite key column for a parent table. Avoid making a join to a parent table that does not have a key column or composite key columns.
Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.
Child Joins
The materialized view is the child table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. A child table typically has a column that serves as a foreign key. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.
Join inheritance
Materialized views do not inherit join relationships from source entities and will need to have their own join relationships created. The data sources available to a materialized view formula column are directly related to the child joins of the materialized view.
Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.
Load Filter
For a materialized view, you can specify a load filter in the Table Editor. A load filter reduces the number of rows that the Analytics Service loads into memory from shared storage, but does not affect how many rows the Loader Service extracts from the data source of the table.
Additional Considerations
Script referencing a schema entity
A script referencing a column in a physical schema or business schema entity must identify the schema and entity. As an example, a SQL query would identify the schema and entity as SCHEMA.ENTITY
.
Script restrictions
Schema restrictions
When you create your materialized view using Incorta PostgreSQL, you are able to reference a business schema view or an alias. These options are not available when using the other provided spark languages.
Session variables
A materialized view cannot reference business schema entities that contain session variables. Attempting to do so will result in an error when attempting to validate the script.
Spark jobs concerning a materialized view
When you manipulate the Spark properties of a materialized view, you may need to see the spark job execution of the materialized view. Any Spark job changes should be made in coordination with your System Administrator.
You can view the Spark job processes at the following URL: http://<server address>:9091
When you validate your script in the query editor, the materialized view will display as one spark job.
While you are running the notebook editor in a materialized view, Spark will display the notebook editor as a running application. Once you finish using the notebook editor and you validate the materialized view script, you will see a second job run when the materialized view is validated.