References → Data Quality Recipe
Data quality is paramount for ensuring the reliability, integrity, and utility of data. High-quality data underpins accurate analysis and informed decision-making. Organizations can reduce risks associated with errors, inefficiencies, and regulatory non-compliance by ensuring data is accurate, complete, consistent, and timely.
Configuration
Configuration | Description |
---|---|
Recipe Name | A freeform name of how a user would like to name a recipe |
Input | Select a previously constructed recipe. This will not be processed but instead will process the below-configured table. |
Schema | The schema that contains the table to perform data quality checks against. |
Base Table | The table to perform data quality checks against. |
Add Rules | Description ● A user-assigned value that declares the intent of the data quality rule. Category ● A high-level categorization of what the rule is intended to address. Column ● The column that should be considered violated by the rule. |
Category definitions
Each data quality rule must have an accompanying category. While this is required, the selection of each of these categories will not change the function of the rule itself. Instead, it’s intended to help identify what the rule is trying to achieve without a user needing to read the rule syntax.
Completeness - the degree to which the necessary data is available for use.
Uniqueness - the degree to which data records are unique and not duplicates.
Timeliness - the degree to which data is up to date and available when it is needed.
Validity - the degree of records’ conformance to format, type, and range.
Accuracy - the degree to which data values align with real values.
Consistency - the degree to which data is consistent across different sources.
Relevance - the degree to which the dataset’s level of detail aligns with its intended purpose.
Conformity - the degree to which data follows the set of standard data definitions like data type, size, and format.
Output
In the result set, there will be a new column prepended to the dataset named Incorta_DQ_Violations. In this column, the number of violations for this row will be shown. Additionally, hovering over the row value will showcase the DQ rule in violation. The information in the hover-over includes:
- A unique identifier
- The description
- The category
The unique identifier can be used as a list item in the remainder of the workflow. For example, if you want to filter all records that contain a certain violation, copy the unique identifier from Incorta_DQ_Violations and paste it as the operand in a filter tool.
Data Quality rule storage
When exporting the MV to a schema, a CSV file is built and saved in Data/Rules/{workflowname}
. This file contains the following attributes from the data flow:
Column Name | Description |
---|---|
Rule_ID | The automatically assigned unique identified for the data quality rule. |
Schema_Name | The base schema in which the data quality rules were applied. |
Table_Name | The base table in which the data quality rules were applied. |
Rule_SQL_Expression | The written Spark SQL expression for the data quality rule |
Rule_Description | The written description that describes the purpose of the data quality rule. |
Rule_Category | The assigned category in which the data quality rule belongs. |
Rule_Owner | The name of the user who created the data quality rule. |
Is_Active_Flag | A true/false value describing if the data quality rule is active in the MV. |
Output_Table | The schema and table where the data quality rule is deployed. |
Timestamp | The time in which the data quality rule was deployed. |