Configure Spark for Use with the SQL Interface

Spark is an Incorta module enabling external clients to connect to Incorta as if it’s a PostgreSQL database. This allows the use of external BI tools (like Tableau and Power BI) as frontiers while Incorta serves as their data source. Under the hood, Spark uses both Incorta engine and Spark to fulfill clients queries, favoring Incorta engine (for speed) where possible.

The basic arrangement of this layout is as follows:

So, as far as the clients are interested, Incorta is the only interface they care about, they send standard PostgreSQL queries to Incorta and it then decides whether:

  • To run the query against Incorta engine and return the result from its in-memory stores, or
  • To delegate the query to Spark which will run directly over the corresponding parquet files

Clients can choose to run queries directly against Spark by connecting to Incorta using a different port.

You can run the same number of queries as the parallel thread limit for the Tomcat server at the same time (30 parallel threads), otherwise queries can slow down and fail.

SQL Interface Configuration

ConfigurationDefault ValueDescription
Default SQL interface port *5436<Port Number> If clients connect to Spark over this port, Incorta engine will handle the queries. This port is encrypted with SSL by default.
Data Store (DS) port *5442<Port Number> If clients connect to Spark over this port, Spark will handle the queries.
Connection pooling *ON<Toggle> Connection to Spark (as with all databases) can be an expensive resource. Connection pooling enables sharing and reusing a set of connections with all clients connecting to Spark.
Connection pool size *10<Number> The number of connections available in the pool for clients to use. If all connections are busy (being used by clients), the next client trying to connect will have to wait until a connection is released back to the pool.
Concurrency *10<Number> Controls how many concurrent operations Incorta would execute to gather metadata.
Default SchemasEMPTY<Text> A comma separated list of schema names to use in case a table name is missing its schema name. The listed schemas will be searched for the table in their order in the list and the first schema to contain the table will be considered. For example: if you set this option to: “HR,Sales”, a table missing a schema name will be looked up in the “HR” schema first, if it’s not found, it will be looked up in “Sales” schema.
Cache *OFF<Toggle> Toggles caching results and metadata in Spark to efficiently serve recurrent queries.
Cache size (in gigabytes) *8Number (memory in gigabytes). Maximum allocated memory for the cache. If this limit is exceeded, the cache will evict the least recently used entries.
Cached query result max size *100000<Number> Maximum size of a single result set allowed in cache. If a result set is larger than this limit, it won’t be cached even if cache is enabled.
Enable cache auto refresh *OFF<Toggle> If enabled, results cached will be auto refresh periodically (see next option). A refresh is operated by inspecting the tables for changes, if there are any changes since last check, the query will be rerun and new resultset will be cached.
Refresh cache interval (in minutes) *60<Number> time in minutes. Number of minutes between cache refresh operations.