Concepts → Physical Schema Table

About a physical schema table

A physical schema table is one of several types of entity objects in a physical schema. These object include:

A physical schema table represents the metadata about an entity of structured data. As an entity object in a physical schema, the metadata and properties of a physical schema table also describe how the Loader Service stores the table's data in shared storage and how the Analytics Loader service loads the table into memory. The metadata includes the following:

Object data lineage

Starting with 2023.7.0, you can view the data lineage of physical schema tables in the Data Lineage Viewer v2 that displays a diagram with the entities referenced in the table (upstream lineage) and entities where the table is referenced (downstream lineage). Additionally, you can view the data lineage per column or formula column in the table.

Depending on the data source(s) configuration, the following additional table level properties may be available in the Table Editor:

The specified metadata and defined properties for a physical schema table govern the workload for the table in relation to a load job for the table itself or for the parent physical schema as a whole.

Name

A physical schema table name must adhere to the following name validation rules:

  • Must be unique to the physical schema
  • Must be between 1 and 128 characters in length
  • Must begin with an alpha character, lower or upper case
  • After the first alpha character, can contain zero or more alphanumeric characters in lower, upper, or mixed case
  • After the first alpha character, can contain zero or more underscore (_) or Dollar sign ($) characters
  • Besides underscore (_) and Dollar sign ($), cannot contain special characters, symbols, or spaces
Important

The name for a physical schema table name is immutable.

Data Source(s)

A physical schema table requires at least one defined data source. A data source can be a local data file, local data folder, or an existing external data source.

Once selected, the Type property exposes the configurable child properties known as table data source properties. For the Data Source property, you can specify an external data source or LocalFiles. LocalFiles refers to uploaded local data files and local data folders. To learn more about an External Data Source, see Concepts → External Data Source

Here are some of the additional considerations for a table data source:

  • Singe-source
  • Multi-source

Single-source

A single-source table has exactly one source also known as a single dataset. The Type property determines the dataset type. You use the Data Source property to specify the external data source or local data files and folder.

To learn more, see the following:

Union Files

For the data source types File System and Data Lake, it is possible to specify either a single File or Directory of files. To select a Directory, you must first enable either the Union or Wildcard Union property. The Union property is analogous to defining a multi-source physical schema table that specifies individual datasets as files. In both cases, the Loader Service loads the multiple files in parallel which, in comparison to a sequential loading, greatly reduces the extraction time.

Recommended practice: Add Filename

When using Union Files, enable the Add Filename as a column property and specify the name of the Filename column, such as File_Name. Although duplicative, the column value will allow you to identify the file source, and in turn, track down any data integrity issues with a specific file.

Chunking Files

When the File Type is Text (csv, tsv, tab, txt) and the source files are large, you can also enable chunking and specify the size of the chunks. In this manner, you can reduce the extraction time as the Loader Service chunks the file by the desired size. This is especially helpful when loading text files larger than 1GB.

Warning: Avoid chunking for incremental load

Chunking can greatly improve the performance for loading large files or running queries with large result sets. However, for incremental loads where the number of rows in the load is much smaller than the initial load, chunking can be ineffective.

Multi-source

A physical schema table may have more than one data source. A multi-source table allows a schema developer to union disparate datasets into one physical schema table. For example, a multi-source table may define one source as a File System dataset and another as a SQL Database dataset. For more information, see Concepts → Multi-Source Table.

Columns

A physical schema table column has several properties that determine how the Loader Service extract and stores data as well as how the Analytics Service loads data. To learn more about a physical schema table column, see Concepts → Physical Schema Table Column.

Formula Columns

A table formula column contains an expression that returns a scalar value of a specific data type. As such, the Loader Service computes and persists the physical schema table formula column to shared storage in a Direct Data Mapping file format. To learn more about a physical schema table formula column, see Concepts → Physical Schema Table Formula Column.

Runtime security filters

As a schema developer, you can apply one or more runtime security filters to restrict row access to a table. Any dependent object such as a runtime business view, dashboard insight, dashboard filter, or internal session variables will automatically apply the runtime security filter. To learn more about a physical schema table runtime security filter, see Concepts → Physical Schema Table Runtime Security Filter.

Parent Joins

The physical schema table is the parent table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

A parent table typically has one key column or composite key columns. In both cases, the key column functions as a primary key. Where there is a key column, the join condition typically contains a join operator an = equals join operator.

It is possible to create a join from a child to parent without using defined key column in the parent table. A table without a key column allows for duplicate values. If the parent table contains duplicate non-key rows, the Loader Service randomly selects a parent table row based on the non-key join. To help simulate a key for the parent table, consider using one or more conditions in the join definition such as > greater than and < less than. Multiple join operators may significantly narrow join values from the parent table so as to create row uniqueness. A join condition with multiple join operators requires internal sorting which may require high calculation costs for the join path.

For a parent table with a composite key, remove the unnecessary key column if possible. Otherwise, use the key column with the highest selectivity in the join. To narrow the join range, consider using additional join operators.

Recommended practice

Whenever possible, create a key column or a composite key column for a parent table. Avoid making a join to a parent table that does not have a key column or composite key columns.

Warning

Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.

Child Joins

The physical schema table is the child table in a join relationship to another entity object in the physical schema or another physical schema within the tenant. A child table typically has a column that serves as a foreign key. By default, the join condition represents a left outer join where the child is on the left side of the join and the parent is on the right side of the join.

Warning

Avoid implicit casting or conversions for disparate data types in a join between a child and parent table. In other words, the child table column and the parent table column should be of the same data type.

Load Filter

For a given physical schema table, you can specify a load filter in the Table Editor. A load filter reduces the number of rows that the Analytics Service loads into memory from shared storage, but does not affect how many rows the Loader Service extract from the data source of the table.

Recommended practice

Create a load filter to help optimize memory use in the Analytics Service for large fact tables. For example, if the Finance team only needs the past 3 months of data in a large fact table, create a load filter for the fact table. Here is an example.

addMonths(OrderDate,-3)

To learn more, see Concepts → Load Filter.

Performance Optimized

When enabled, the Performance Optimized property instructs the Analytics Service to load the physical schema table into memory, and if applicable, apply a specified Load Filter. The Table Editor disables the Performance Optimized property automatically for a physical schema table that uses a Data Lake data source with the Remote Table property enabled.

With a performance optimized table, you can reference the physical schema table in the runtime context of the Analytics Service. This includes the following:

  • a physical schema table runtime security filter
  • an Incorta Analyzer Table column
  • an Incorta Analyzer Table column
  • a business schema view column
  • a business schema view formula column
  • an Incorta Analyzer View column
  • an Incorta Analyzer View formula column
  • an internal session variable
  • a filter expression session variable
  • an insight dimension or measure
  • an insight formula column
  • any insight filter such as an individual filter, a distinct filter, an aggregated filter, or a measure filter)
  • dashboard filter such as a prompt, an applied filter, or a filter option
Recommended practice

If a physical schema table is not being used in a runtime context, disable the Performance Optimized property. In this regard, a schema developer may purposefully designate a physical schema table as a non-performance optimized table. For example, a materialized view can read the Apache Parquet files from a non-performance optimized physical schema table. In addition, the SQLi Interface supports querying a non-performance optimized physical schema table via Apache Spark. In these instances, consider using a standardized naming convention to indicate your intentions to users. A table prefix such as npo helps users of external integration tools such as Microsoft PowerBI and Tableau know that the table is queryable, but the performance may be sub-optimal. An example of non-performance optimized table name is npo_tbl_Sales.

Important

After the completion of a successful load job, the following operations are available for a non-performance optimized physical schema table:

  • an external integration tool can read the table via the SQLi interface, the sqlApp, and Apache Spark
  • a materialized view can read the table using a SQL query or data frame reference
  • a physical schema table can reference the table in the context of a PostgreSQL external data source that specifies the Incorta tenant itself as the database

Disable Full Load

The Disable Full Load property is available for a physical schema table only if it has the Incremental property enabled. A schema developer may enable the Incremental property for the data source of the physical schema table. For an incremental load of a physical schema, a physical schema table with the Incremental property enabled observes the related data source configurations such as using a specific file or executing the specified Update Query. For a full load of a physical schema, the Loader Service will exclude a physical schema table with the Disable Full Load property enabled. Typically, a schema developer will perform a table full load at least once before enabling the Disable Full Load property.