Concepts → Full Load

About a full load

The full load strategy is one of multiple data load strategies available in Incorta. There are several options for loading data. You can load data for a given physical schema or a given object. You can also load data on-demand or schedule a load job to run. During a load job, data loading can be from source: full load or incremental load, or from staging (Shared Storage).

You can perform a full load for a physical schema or an object. Physical schema full load jobs can be on-demand or scheduled while object full load jobs are available only on demand.

For more information about loading data in Incorta, refer to References → Data Ingestion and Loading.

How to start or schedule a full load job

A Super User tenant administrator or a user that belongs to a group with the SuperRole or Schema Manager role can start a load job or create a scheduled job to run one or more unattended load jobs for the same physical schema. As a schema developer, you can start or schedule a full load job from the Schema Designer. You can also create a full load scheduled job from the Schema Manager or the Scheduler.

Note

When a full load job starts, the Loader Service, by default, performs a full load for all physical schema tables and MVs. However, in the case of physical schema tables and MVs that have incremental load enabled and full load disabled, the Loader Service may throw errors in the case of the first full load job or skips them during next full load jobs. Typically, a schema developer performs a full load of an object at least once before enabling the Disable Full Load property.

Warning

Incorta does not recommend running concurrent schema model update jobs and load jobs on the same schema or dependent schemas as this may result in errors or inaccurate data.

Schema updates that require a full load

Some updates you make to the physical schema objects require loading data fully from source to ensure data consistency.

The following are the updates that require a full load:

  • Adding a new physical schema table or materialized view (MV)
  • Changing the data type of a physical schema table column or materialized view column
  • Changing the source of a physical schema table or MV, whether by selecting another source file in the Data Source properties dialog or editing the query
  • Adding or changing a key column (changing the column function from key to dimension or measure and vice versa) in a physical schema table or MV
  • Adding a new physical schema table column
  • Adding a new MV column
  • Changing the object type, for example, changing a physical schema table to an Incorta Analyzer table or MV
  • Removing a physical schema table column or an MV column that functions as a key
  • Changing the encryption status of one or more columns in a physical schema table or MV

The full load job cycle

A full load job goes through the following stages:

During a full load job, the following occur:

  • The Loader Service extracts data from the data source for each physical schema table or the single specified table according to the table data source properties.
  • The Loader Service creates new source parquet files in the source directory. The Loader Service creates a new parquet version directory with an offset subdirectory to save these files.
  • If the Data LoadingEnable Always Compact option is enabled in the Cluster Management Console (CMC), the Loader Service also creates a compacted version of the object parquet files in the compacted directory. A data deduplication process precedes the compaction process to mark duplicate data that will be removed from the compacted parquet files.
  • For a materialized view, the Loader Service passes the query of the MV Script to Spark. Spark reads data from the parquet files of the underlying physical schema objects and creates new parquet files for the MV in a new parquet version directory in the source directory. A compacted version of the MV parquet files is also created if compaction is enabled.
Note

Spark reads the MV data from the compacted parquet files of the underlying object in the case that the underlying object is a physical schema table or another MV. However, starting with release 5.1.2, a materialized view can reference columns from Incorta SQL tables or Incorta Analyzer tables in other physical schemas. In such a case, Spark will read data from the source parquet files of these Incorta tables as they do not have a compacted version.

  • For Incorta Analyzer tables and Incorta SQL tables, the Loader Service creates full parquet files in the source directory. Prior to release 5.1.2, the Loader Service would create snapshot DDM files for these tables in the ddm directory (also known as snapshot in older releases).
  • For physical schema tables and MVs with performance optimization enabled, the Loader Service loads data to the Engine memory. The Engine then calculates any formula columns, key columns, or load filters for each object and creates snapshot DDM files. These files are saved to the schemas directory that exists in the ddm directory.
  • In the case that there is a join relationship where one of the physical schema objects is the child table, the Engine creates a new version of the join DDM files and saves them to the joins directory that exists in the ddm directory.
Important

The described behavior and output are applicable starting with release 5.1 where the Loader Service creates a new version of files. For older releases, a full load job deletes all existing parquet, DDM, and compacted files and creates new ones.

Enforce Primary Key Constraint

The Enforce Primary Key Constraint option is available starting with the 5.1.2 release. It is available in the case of a physical schema table or MV with a key column. It is enabled by default. You can enable it to enforce the calculation of the primary key index or disable it to skip this calculation to optimize data load time and performance.

  • When enabled, the Loader Service calculates the primary key index to enforce record uniqueness during a full load job.
  • When disabled, the Loader Service skips this calculation. Disable it only if your dataset has unique records.