Concepts → Data Purge

Overview

Data Purge is a new load type introduced in 2024.7.x to remove unnecessary data from physical tables or materialized views (MVs). This process enhances system performance while optimizing disk space usage.

Prepare for a data purge

Configure objects

Before executing a data purge job, you must configure the relevant objects and identify the data to be deleted. Unneeded data can be designated per object through a data retention policy or by defining an exclusion set.

A data retention policy defines the conditions data must meet to be retained; otherwise, it is marked for deletion.
Alternatively, an exclusion set is a physical schema object containing the identifiers of rows designated for deletion. You must specify the mapping columns in the target table and its corresponding exclusion set.

Configure Spark

The purge process is run via Spark. You can enable dynamic resource allocation for Spark applications running data purge and Parquet merge processes, ensuring efficient resource usage based on the workload demands.

Spark application configurations can also be customized, allowing for tailored behavior to meet specific operational requirements. Parameters such as application cores, memory for executors, and driver memory can be adjusted in the Cluster Management Console (CMC) > Server Configurations > Spark Integration.

Start or schedule a data purge job

A Super User or any user with the SuperRole or Schema Manager roles can initiate a data purge job or schedule one or more unattended purge jobs for multiple physical schemas within a load plan.

You can start or schedule a purge job from the Schema Designer. You can also create a data purge scheduled plan from the Schema Manager or the Scheduler.

The data purge load type is available for:

An individual table or materialized view (MV) with configurations for data retention or an exclusion set
An individual schema if one or more schema objects have configurations for data retention or an exclusion set
Schemas in a load plan

The data purge cycle

When a data purge job is executed for one or more schemas, all schema objects (excluding aliases) are included in the load job. However, only physical tables and MVs with data purge configurations have their data processed to be deleted while other objects are loaded from staging.

Objects with data retention policies

Incorta reads and evaluates records against the data retention condition to get the list of records to delete for each Parquet file.

If records need to be deleted,
- Incorta sends the data purge script to Spark, which includes the path to the target table and a list of Parquet files containing the records marked for deletion.
- Spark deletes data that does not meet the retention policies and creates a new version of the Parquet files and incremental segments.
If no records need to be deleted, Incorta does not submit a job to Spark and performs a regular load-from-staging job, in which no new file version is created.

Objects with a deletion dataset

Incorta sends the data purge script with the following details to Spark:

The path to the compacted version of the exclusion table
The mapping columns in the target object and the exclusion table

Spark deletes data shared between the source tables and their exclusion sets and creates a new version of the Parquet files and incremental segments. Spark will create a new version even if no records have been deleted.

Objects with data retention and deletion dataset configurations

Incorta decides on whether the data should be deleted per the data retention policy.
Incorta sends the data purge script with the details required for both the data retention condition and the exclusion dataset.
Spark executes the script, deletes the required data, and creates a new version of Parquet files and segments. If Spark fails to execute the purge script for any of the two configurations:
- No data is deleted.
- No new version is written.
- The load of the table or MV continues as a staging load.

Additional considerations

If an exclusion dataset has data purge configurations, it can’t be in the same load job as its target table.
The new Parquet files written by the data purge step have shuffled data.
Only physical tables and MVs can be used as exclusion datasets.
Only data columns can be used when defining an exclusion dataset.

Content