Concepts → Data Purge

Overview

Data Purge is a new load type introduced in 2024.7.x to remove unnecessary data from physical tables or materialized views (MVs). This process enhances system performance while optimizing disk space usage.

Prepare for a data purge

Configure objects

Before executing a data purge job, you must configure the relevant objects and identify the data to be deleted. Unneeded data can be designated per object through a data retention policy or by defining an exclusion set or inclusion set (available starting 2026.3.0).

  • A data retention policy defines the conditions data must meet to be retained; otherwise, it is marked for deletion.
  • Alternatively, an exclusion or inclusion set is a physical schema object containing the identifiers of rows to delete or to retain respectively. You must specify the mapping columns in the target table and its corresponding exclusion or inclusion set.

Configure Spark

The purge process is run via Spark. You can enable dynamic resource allocation for Spark applications running data purge and Parquet merge processes, ensuring efficient resource usage based on the workload demands.

Spark application configurations can also be customized, allowing for tailored behavior to meet specific operational requirements. Parameters such as application cores, memory for executors, and driver memory can be adjusted in the Cluster Management Console (CMC) > Server Configurations > Spark Integration.

Start or schedule a data purge job

A Super User or any user with the SuperRole or Schema Manager roles can initiate a data purge job or schedule one or more unattended purge jobs for multiple physical schemas within a load plan.

You can start or schedule a purge job from the Schema Designer. You can also create a data purge scheduled plan from the Schema Manager or the Scheduler.

The data purge load type is available for:

The data purge cycle

When a data purge job is executed for one or more schemas, all schema objects (excluding aliases) are included in the load job. However, only physical tables and MVs with data purge configurations have their data processed to be deleted while other objects are loaded from staging.

Objects with data retention policies

Incorta reads and evaluates records against the data retention condition to get the list of records to delete for each Parquet file.

  • If records need to be deleted,
    • Incorta sends the data purge script to Spark, which includes the path to the target table and a list of Parquet files containing the records marked for deletion.
    • Spark deletes data that does not meet the retention policies and creates a new version of the Parquet files and incremental segments.
  • If no records need to be deleted, Incorta does not submit a job to Spark and performs a regular load-from-staging job, in which no new file version is created.

Objects with a deletion or inclusion dataset

Incorta sends the data purge script with the following details to Spark:

  • The path to the compacted version of the exclusion or inclusion table
  • The mapping columns in the target object and the exclusion or inclusion table

Spark removes data marked for deletion and creates a new version of the Parquet files and incremental segments. Spark will create a new version even if no records have been deleted.

Objects with data retention and deletion or inclusion dataset configurations

  1. Incorta decides on whether the data should be deleted per the data retention policy.
  2. Incorta sends the data purge script with the details required for both the data retention condition and the exclusion or inclusion dataset.
  3. Spark executes the script, deletes the required data, and creates a new version of Parquet files and segments. If Spark fails to execute the purge script for any of the two configurations:
    • No data is deleted.
    • No new version is written.
    • The load of the table or MV continues as a staging load.

Additional considerations

  • If an exclusion or inclusion dataset has data purge configurations, it cannot be in the same load job as its target table.
  • The new Parquet files written by the data purge step have shuffled data.
  • Only physical tables and MVs can be used as exclusion or inclusion datasets.
  • Only data columns can be used when defining an exclusion or inclusion dataset.
  • Columns used for synchronizing delete operations between a data source and Incorta using exclusion or inclusion sets cannot be encrypted.