Concepts → Data Retention
Overview
Data retention refers to the practice of maintaining data for a specific period. Starting 2024.7.x, Incorta has introduced new table-level settings for schema managers to control better the data stored on disk based on predefined criteria. This feature promotes efficient data management, improves performance, and optimizes resource and disk space usage.
Data retention policies can apply only to physical schema tables and materialized views (MVs). A purge job is required to remove data not meeting the retention criteria.
Creating a data retention policy
Data retention settings can be configured on the Advanced Settings tab of the Table Editor for any physical table or MV. You can define data retention policies using time-window configurations or custom conditions.
Exercise caution when setting criteria for data retention. Once data is purged, it is irretrievable. You can fully load affected tables and MVs to recover from an accidental purge operation.
Data retention via a time-window configuration
For time-window retention policies, you specify the time window based on a date or timestamp column in the table or MV. Records within the defined time window will be retained while those outside will be marked for deletion during the next purge job.
Data retention via a custom condition
If a dataset lacks a date or timestamp column, you can create a custom condition that defines which data to retain. Records satisfying the custom condition will remain while those not satisfying the condition will be marked for deletion.
Custom conditions offer more flexibility than time-window configurations. Within a custom condition, you can:
- Reference columns of different data types
- Use system variables
- Use different types of built-in functions
Purging unneeded data
After configuring a data retention policy, data that does not meet the retention criteria can be removed via a data purge job.
You can execute a purge job manually for individual tables or MVs or across all tables and MVs within a schema in the same dialogues you use to perform other load actions. Alternatively, you can schedule purge jobs (via a load plan) to clean up data from one or more physical schemas simultaneously or sequentially.
- When data retention and exclusion set configurations are defined for one or more objects in a purge job, the job deletes data that does not meet the retention policy and data shared between the source table and the exclusion set.
- After deleting data, the purge job creates a new version of the object’s files. Parquet files created by the data purge job have shuffled data.
For more details about data purge jobs, refer to Concepts → Data Purge.