Concepts → Full Load

About a full load

The full load strategy is one of multiple data load strategies available in Incorta. There are several options for loading data. You can load data for a given physical schema or a given object. You can also load data on-demand or schedule a load job to run. During a load job, data loading can be from source: full load or incremental load, or from staging (Shared Storage).

You can perform a full load for a physical schema or an object. Physical schema full load jobs can be on-demand or scheduled while object full load jobs are available only on demand.

For more information about loading data in Incorta, refer to References → Data Ingestion and Loading.

How to start or schedule a full load job

A Super User tenant administrator or a user that belongs to a group with the SuperRole or Schema Manager role can start a load job or create a scheduled job to run one or more unattended load jobs for the same physical schema. As a schema developer, you can start or schedule a full load job from the Schema Designer. You can also create a full load scheduled job from the Schema Manager or the Scheduler.

Note

When a full load job starts, the Loader Service, by default, performs a full load for all physical schema tables or materialized views (MVs). However, in the case of physical schema tables and MVs that have incremental load enabled and full load disabled, the Loader Service skips them during full load jobs. Typically, a schema developer performs a full load of an object at least once before enabling the Disable Full Load property.

Warning

Incorta does not recommend running concurrent schema model update jobs and load jobs on the same schema or dependent schemas as this may result in errors or inaccurate data.

Schema updates that require a full load

Some updates you make to the physical schema objects require loading data fully from source to ensure data consistency.

The following are the updates that require a full load:

  • Adding a new physical schema table or MV
  • Changing the data type of a physical schema table column or materialized view column
  • Changing the source of a physical schema table or MV, whether by selecting another source file in the Data Source properties dialog or editing the query
  • Adding or changing a key column (changing the column function from key to dimension or measure and vice versa) in a physical schema table or MV
  • Adding a new physical schema table column
  • Adding a new MV column
  • Changing the object type, for example, changing a physical schema table to an Incorta Analyzer table or MV
  • Removing a physical schema table column or an MV column that functions as a key
  • Changing the encryption status of one or more columns in a physical schema table or MV

The full load job cycle

A full load job goes through the following stages:

During a full load job, the following occurs:

  • The Loader Service extracts data from the data source for each physical schema table or the single specified table according to the table data source properties.

  • The Loader Service creates new source parquet files in the source directory. The Loader Service creates a new parquet version directory with a subdirectory to save these files.

  • When the Table Editor → Enforce Primary Key Constraint is enabled for an object, primary key index calculations (deduplication) start to mark duplicate records that must be deleted to ensure that only unique data records exist.

  • If the Cluster Management Console (CMC) → Tenant Configurations → Data Loading → Enable Always Compact option is enabled, a compaction job starts to remove duplicate rows and create a compacted version of the object parquet files in the object’s _rewritten directory in the source area. The following are the consumers of compacted parquet files: MVs, SQLi queries on Spark port, internal and external Notebook services, and Preview data function.

    Important

    In releases before 5.2, a compaction job resulted in both: rewriting a compacted version of each parquet file that has duplicates and copying other extracted parquet files. Copied and rewritten parquet files were saved to the compacted directory under the tenant directory. The compacted directory might have multiple versions of compacted files of the same object. Consumers of compacted parquet files were directed to read data from the latest committed compacted version of the parquet files in the compacted directory.

  • When the Enforce Primary Key Constraint property is disabled for an object, both the deduplication and compaction calculations for this object are skipped.

  • At the end of the compaction job, a group of metadata files is generated in Delta Lake file formats to point to all parquet files (whether extracted or rewritten) that constitute a compacted version. Consumers of the compacted parquet will use the Delta Lake metadata files to find out which extracted or compacted parquet file versions to read data from.

  • For an MV, the Loader Service passes the query of the MV Script to Spark. Spark reads data from the parquet files of the underlying physical schema objects and creates new parquet files for the MV in a new parquet version directory in the source directory. A compacted version of the MV parquet files is also created in the object’s _rewritten directory if compaction is enabled.

    Note

    Spark reads the MV data from the compacted parquet files of the underlying object in the case that the underlying object is a physical schema table or another MV. However, starting release 5.1.2, an MV can reference columns from Incorta SQL tables or Incorta Analyzer tables in other physical schemas. In such a case, Spark will read data from the source parquet files of these Incorta tables as they do not have a compacted version.

    With the new compaction mechanism introduced in the 5.2 release, for each of these tables, a _delta_log directory exists in the object directory to include a group of metadata files that compacted parquet consumers (such as Spark) use to find out the parquet files of each object version to read from.

  • For Incorta Analyzer tables and Incorta SQL tables, the Loader Service creates full parquet files in the source directory. Prior to release 5.1.2, the Loader Service would create snapshot DDM files for these tables in the ddm directory (also known as snapshot in older releases).

    Important

    With the introduction of the derived tables' support for key columns in 5.2.11, the Loader Service creates snapshot DDM files for the unique index each time the Analyzer or SQL table's key columns are updated or the schema or table is loaded.

  • For physical schema tables and MVs with performance optimization enabled, the Loader Service loads data to the Engine memory. The Engine then calculates any formula columns, key columns, or load filters for each object and creates snapshot DDM files. These files are saved to the schemas directory that exists in the ddm directory.

  • In the case that there is a join relationship where one of the physical schema objects is the child table, the Engine creates a new version of the join DDM files and saves them to the joins directory that exists in the ddm directory.

Important

The described behavior and output are applicable starting the 5.1 release where the Loader Service creates a new version of files. For older releases, a full load job deletes all existing parquet, DDM, and compacted files and creates new ones.

Enforce Primary Key Constraint

The Enforce Primary Key Constraint option is available starting the release 5.1.2. It is available in the case of a physical schema table or MV with a key column. This option is enabled by default for physical schema tables and MVs with one or more key columns. You can disable it to skip the calculation of the primary key index and optimize data load time and performance, or you can enable it to enforce the calculation of the primary key index.

  • When enabled, the Loader Service calculates the primary key index to enforce record uniqueness during a full load job.
  • When disabled, the Loader Service skips this calculation. Disable it only if your dataset has unique records.