Concepts → Full Load
About a full load
The full load strategy is one of multiple data load strategies available in Incorta. There are several options for loading data. You can load data for a given physical schema or a given object. You can also load data on-demand or schedule a load job to run. During a load job, data loading can be from source: full load or incremental load, or from staging (Shared Storage).
You can perform a full load for a physical schema or an object. Physical schema full load jobs can be on-demand or scheduled while object full load jobs are available only on demand.
For more information about loading data in Incorta, refer to References → Data Ingestion and Loading.
How to start or schedule a full load job
A Super User tenant administrator or a user that belongs to a group with the SuperRole or Schema Manager role can start a load job or create a scheduled job to run one or more unattended load jobs for the same physical schema. As a schema developer, you can start or schedule a full load job from the Schema Designer. You can also create a full load scheduled job from the Schema Manager or the Scheduler.
When a full load job starts, the Loader Service, by default, performs a full load for all physical schema tables or materialized views (MVs). However, in the case of physical schema tables and MVs that have incremental load enabled and full load disabled, the Loader Service skips them during full load jobs. Typically, a schema developer performs a full load of an object at least once before enabling the Disable Full Load property.
Incorta does not recommend running concurrent schema model update jobs and load jobs on the same schema or dependent schemas as this may result in errors or inaccurate data.
Schema updates that require a full load
Some updates you make to the physical schema objects require loading data fully from source to ensure data consistency.
The following are the updates that require a full load:
- Adding a new physical schema table or MV
- Changing the data type of a physical schema table column or materialized view column
- Changing the source of a physical schema table or MV, whether by selecting another source file in the Data Source properties dialog or editing the query
- Adding or changing a key column (changing the column function from key to dimension or measure and vice versa) in a physical schema table or MV
- Adding a new physical schema table column
- Adding a new MV column
- Changing the object type, for example, changing a physical schema table to an Incorta Analyzer table or MV
- Removing a physical schema table column or an MV column that functions as a key
- Changing the encryption status of one or more columns in a physical schema table or MV
The full load job cycle
A full load job goes through the following stages:
- Extraction
- Transformation / Enrichment (in the case of an MV)
- Load and post-load
- Send to Destination - Only available if you set a data destination for a schema
During a full load job, the following occurs:
The Loader Service extracts data from the data source for each physical schema table or the single specified table according to the table data source properties.
The Loader Service creates new source parquet files in the
source
directory. The Loader Service creates a new parquet version directory with a subdirectory to save these files.When the Table Editor → Enforce Primary Key Constraint is enabled for an object, primary key index calculations (deduplication) start to mark duplicate records that must be deleted to ensure that only unique data records exist.
If the Cluster Management Console (CMC) → Tenant Configurations → Data Loading → Enable Always Compact option is enabled, a compaction job starts to remove duplicate rows and create a compacted version of the object parquet files in the object’s
_rewritten
directory in thesource
area. The following are the consumers of compacted parquet files: MVs, SQLi queries on Spark port, internal and external Notebook services, and Preview data function.ImportantIn releases before 2022.2.0, a compaction job resulted in both: rewriting a compacted version of each parquet file that has duplicates and copying other extracted parquet files. Copied and rewritten parquet files were saved to the
compacted
directory under the tenant directory. Thecompacted
directory might have multiple versions of compacted files of the same object. Consumers of compacted parquet files were directed to read data from the latest committed compacted version of the parquet files in thecompacted
directory.Staring with 2023.1.0, the Enable Always Compact option won’t be available on the CMC as creating a compacted version of parquet files during load jobs is no longer an option. The Loader Service will start a compaction job during all load jobs. During loading from staging, the Loader Service initiates a compactions job if it detects issues with the compacted version of an object.
When the Enforce Primary Key Constraint property is disabled for an object, both the deduplication and compaction calculations for this object are skipped.
At the end of the compaction job, a group of metadata files is generated in Delta Lake file formats to point to all parquet files (whether extracted or rewritten) that constitute a compacted version. Consumers of the compacted parquet will use the Delta Lake metadata files to find out which extracted or compacted parquet file versions to read data from.
For an MV, the Loader Service passes the query of the MV Script to Spark. Spark reads data from the parquet files of the underlying physical schema objects and creates new parquet files for the MV in a new parquet version directory in the
source
directory. A compacted version of the MV parquet files is also created in the object’s_rewritten
directory if compaction is enabled.NoteSpark reads the MV data from the compacted parquet files of the underlying object in the case that the underlying object is a physical schema table or another MV. However, starting with release 2021.3.2, a materialized view can reference columns from Incorta SQL tables or Incorta Analyzer tables in other physical schemas. In such a case, Spark will read data from the source parquet files of these Incorta tables as they do not have a compacted version.
With the new compaction mechanism introduced in the 2022.2.0 release, for each of these tables, a
_delta_log
directory exists in the object directory to include a group of metadata files that compacted parquet consumers (such as Spark) use to find out the parquet files of each object version to read from.
- For Incorta Analyzer tables and Incorta SQL tables, the Loader Service creates full parquet files in the
source
directory. Prior to release 2021.3.2, the Loader Service would create snapshot DDM files for these tables in theddm
directory (also known assnapshot
in older releases). - For physical schema tables and MVs with performance optimization enabled, the Loader Service loads data to the Engine memory. The Engine then calculates any formula columns, key columns, or load filters for each object and creates snapshot DDM files. These files are saved to the
schemas
directory that exists in theddm
directory. - In the case that there is a join relationship where one of the physical schema objects is the child table, the Engine creates a new version of the join DDM files and saves them to the
joins
directory that exists in theddm
directory.
The described behavior and output are applicable starting with release 2021.3.1 where the Loader Service creates a new version of files. For older releases, a full load job deletes all existing parquet, DDM, and compacted files and creates new ones.
Enforce Primary Key Constraint
The Enforce Primary Key Constraint option is available starting with release 2021.3.1. It is available in the case of a physical schema table or MV with a key column. As for the 2022.4.0 release, this option is disabled by default for newly created physical schema tables and MVs with one or more key columns. You can enable it to enforce the calculation of the primary key index or disable it to skip this calculation to optimize data load time and performance.
- When enabled, the Loader Service calculates the primary key index to enforce record uniqueness during a full load job.
- When disabled, the Loader Service skips this calculation. Disable it only if your dataset has unique records.
In releases before 2023.7.0, when the Enforce Primary Key Constraint option was disabled for physical tables or MVs, and the selected key columns resulted in duplicate key values, unique index calculations would not fail, the first matching value was returned whenever a single value of the key columns was required.
Starting with release 2023.7.0, in such a case, the unique index calculation will fail, and the load job will finish with errors. You must either select key columns that ensure row uniqueness and perform a full load or enable the Enforce Primary Key Constraint option and load tables from staging to have the unique index correctly calculated.