You are viewing content for 4.5 | 4.4 | 4.3 | Previous Releases


Configure Standalone Spark

Standalone Spark is bundled with Incorta so no installation is necessary. You can configure standalone Spark to meet your requirements.

By default, Spark is installed on the same node as Incorta; however, you must install Spark on a separate server in most cases.

  1. Verify that the following files use the correct machine name, CNAME, localhost or IP address. These files can be found from the <incorta_home>/IncortaNode directory.

    • spark/conf/spark-env.sh - SPARK_MASTER_IP=<ip-address-or-DNS-name>
    • spark/conf/spark-defaults.conf - spark.master value: spark://<ip-address-or-DNS-name>:7077
  2. Verify that the Spark master URL is set in the Admin UI under System Configuration > Server Configs > Spark Integration > Spark master URL.
  3. Create a Spark home environment variable. In Linux, set the environment variable in your .bash_profile file. For example, export SPARK_HOME=/home/incorta/IncortaAnalytics/IncortaNode/spark.
  4. Configure options in Spark. For details, see Configuration Options.
  5. Start and Validate Spark. For details, see Start and Validate Standalone Spark.
  6. Test Spark. For details, see Test Spark.

Configuration Options

Set the following configuration option in the <incorta_home>/IncortaNode/spark/conf/spark-env.sh file:

  • SPARK_WORKER_MEMORY: For standalone Spark, this sets the total memory that can be used by the Spark instance on this machine.

Set the following configuration option in the <incorta_home>/IncortaNode/spark/conf/spark-defaults.conf file:

  • spark.cores.max: The maximum number of cores an executor can use to run a job.
  • spark.executor.cores: The number of cores assigned to each executor. This number must be equal to or less than spark.cores.max and the total number of executors allowed to run — floor(spark.cores.max / spark.executor.cores)
  • spark.sql.shuffle.partitions: Defaults to 200. The best setting can vary depending on the Materialized View. You can find it by testing. If this value is set too low and data volume is high, the executor may not have enough memory to complete the job.
  • spark.driver.memory: The driver Incorta services use to communicate with Spark. You may need to increase this value from the default of 4GB for higher data volumes.

Set the following optional configuration options:

  • spark.worker.cleanup.enabled: Enables periodic cleanup of worker and application directories. Disabled by default. Set to true to enable it.
  • spark.worker.cleanup.interval: The frequency, in seconds, that the worker cleans up old application work directories. The default is 30 minutes. Modify the value as you deem appropriate.
  • spark.worker.cleanup.appDataTtl: Controls how long, in seconds, to retain application work directories. The default is 7 days, which is generally inadequate if Spark jobs are run frequently. Modify the value as you deem appropriate.
  • Spark tmp directory: Incorta works best when you use a local disk instead of a shared mounted NFS disk to store temporary files. Ensure that the tmp folder has at least 5GB.

Start and Validate Standalone Spark

After you change the Spark master URL in the Admin UI under System Configuration > Server Configs > Spark Integration > Spark master URL:

  1. Restart all Loader and Analytics services.
  2. Use the startSpark.sh script to start the Master and Worker Spark processes.
  3. Validate the Spark Master by opening the Master Web UI in a browser: http://<ip-address-or-DNS-name>:9091 and confirm the page loads and a worker is running:

spark22

Test Spark

Once you have Spark up and running, create a materialized view in Incorta that uses a pySpark script to test that Spark is working. You must have at least one schema and a table defined to perform this test.

  1. Navigate to the Schemas tab of the Schema page.
  2. Select the schema where you want to add a materialized view.
  3. Select + New, then Materialized View.
  4. Select Python from the Language drop-down list.
  5. Paste the following script in the Script field. The most basic Spark Python script reads a table in Incorta and saves it, duplicating the table.
df = read("SCHEMA.TABLE")
save(df)
  1. Select Add Property to provide the “key” and “value” in the respective fields.
  2. Select Add.
  3. Perform a full load on the materialized view to start the Spark job and load the materialized view table (Load Table).
  4. Verify that the materialized view contains the same number of rows/columns as the source SCHEMA.TABLE that you duplicated.

Configure Standalone Spark on Separate Server

Standalone Spark performs optimally on a dedicated server. To move or install the bundled Spark version on a new node, perform the following steps:

  1. Set up Incorta and Spark nodes with a shared disk accessible by both nodes.
  2. Install Incorta on Incorta-Node.
  3. Create a Tenant in Incorta and specify the tenant folder path to be on a shared disk.
  4. Copy SPARK_HOME from Incorta-Node to the shared disk on Spark-Node.
  5. Verify the values in the Spark configuration files the same way you verify the values for a single node installation.
  6. Start the Spark master and worker servers from Spark-Node, not from Incorta-Node.

Port Requirements

By default Spark is installed on the same node as Incorta. However for performance improvement, Incorta and Spark can be installed on separate nodes.

The following table outlines the ports you can use for Spark. All are incoming/outgoing with no installation parameter.

Port # Purpose Notes
7077 Master Port If Spark exists on a remote server, this port must be open for remote access.
7078 Worker Port If Spark exists on a remote server, this port must be open for remote access.
8080 MasterWebUI port
8081 workerWebUI port
9091 Spark Master Web UI Port For Spark master, this port must be open for remote access to monitor the Spark jobs and logs.
9092 Spark Worker Web UI Port For Spark worker, this port must be open for remote access to monitor the Spark jobs and logs.

Minimum Configuration Setup

Configuring Spark to deliver optimum performance heavily depends on the environment it’s running on, i.e. resources available to use: cores, memory, disk, … etc. To make sure you avoid the most common problems, you need to make sure that at minimum you have the following configurations set. Disk Space for Spark Working Directory Spark relies on disk to shuffle data among executors while executing a query, so, the larger the dataset, the more disk space will likely be required to accommodate those needs. Also it is highly recommended that this disk should be SSD for faster performance, and we recommend a minimum of 500GB free space for a standard production deployment.

Worker Memory

Worker memory is the entire memory space available for Spark executors to allocate from upon initiation. It needs to be large enough to accommodate all required executors. This amount should be larger than the collective amount of memory allocated for the application and driver.

If Spark and Incorta are running on the same machine, make sure that memory allocated to Incorta + memory allocated to Spark (worker memory) + other necessary memory spaces (for the OS and for Spark Driver application and MV’s Driver application) don’t exceed the physical memory of the machine. Both configurations can be set in SPARK_HOME/conf/spark-env.sh , you can  start by setting them like this:

  • SPARKWORKERMEMORY=1024g
  • SPARKLOCALDIRS=/path/to/directory

Where: /path/to/directory is any path to an empty directory on a disk that has at least 500GB of free disk space. Check the available disk space using the following command: df -lh  which returns an output that looks like this:

Filesystem              Size  Used Avail Use% Mounted on devtmpfs                961G   96K  961G   1% /dev tmpfs                   961G     0  961G   0% /dev/shm /dev/xvda1               50G   42G  7.1G  86% / /dev/xvdh               246G     0  600G   1% /disk-1

So, you can set the configuration as follows: SPARKLOCALDIRS=/disk-1. Incorta SQL App configuration Parameters

As explained in the Detailed Configuration section below, Incorta lets you control a lot of configuration, here’s a safe starting point for those configs:

  • SQL App Cores: 32 (number of cores allocated to all executors on all spark nodes)
  • SQL App executors: 16 (minimum 2 cores per executor)
  • SQL App Shuffle partitions: 32 (same as number of cores or double, recommended not to exceed 200)
  • SQL App Memory: 320 (Minimum 4GB per executor)
  • SQL App Driver Memory: 32 (between 4GB and 32GB, should be set to large value if you are expecting ETL queries that returns large datasets)
  • N.B.: Driver memory is allocated on the machine where the Spark app was submitted from, i.e. Incorta machine, so, make sure the machine has enough memory to accommodate the Driver app

Those values are just guidelines for production environments, you may want to change them according to the resources available, the datasets in question and the nature of the queries.

Detailed Configurations

Incorta admin page provides a number of configurations to control Spark allocated resources and communication with Spark. The admin page is located at: http:///incorta/admin/#/system-config. Spark configurations are found under two tabs: SQL Interfaceand Spark Integration. spark23

Some configurations can be modified and take effect without the need to restart the server, others can be modified but won’t take effect until Incorta server has been restarted. Configurations that require a server restart will be marked by an asterisk (*) in the tables below.

Advantages of Using Spark

Using Spark as your canonical data source, aggregating data from multiple, heterogeneous data sources, and serving data through a unified API, leverages Incorta powerful features, for example:

  • Incorta capacity to handle large volumes of data (versus accessing your original data sources directly)
  • Incorta capability to model joins across heterogeneous data coming from different sources
  • Boosting query times, harnessing the power of Incorta fast engine
  • You can send SQL queries directly to Spark, which will hit Incorta data files to gather result sets
  • Simplicity of Incorta business schemas, where users can define virtual schemas of tables and query using those schemas
  • Convenience of Incorta security filters, where you can restrict access to certain records in tables according to the user running the queries

Execution Modes

Incorta comes shipped with Spark in case you don’t have a running Spark cluster or a standalone Spark installed. You can use it for Standalonemode. That said, Spark can operate on external Spark installations, whether they’re Standaloneor Cluster. Appendix B contains more information about Spark deployment options.

Note that in case of using Cluster mode, you should make sure parquet files are stored in a shared medium, either an NFS, HDFS or a similar medium.

Spark can be set up to use a remote managed Spark cluster, for more information about those, see Cluster Setup.

Communication with Spark

How you communicate with Spark determines its behavior. Spark exposes two ports for clients to bind to. Depending on which port the client uses, the query will be routed as follows:

  • One port, will direct the query to a decision point where it’s either executed against Incorta engine or Spark according to the query structure
  • The other, will push queries directly to Spark Typically, queries running against Incorta engine will execute faster than those directed to Spark.

Possible Query Paths

Below, are diagrams showing possible paths taken by different queries during execution.

Case #1: Query Will Be Fulfilled by Incorta engine

spark18

Case #2: Query Sent to Incorta But Will Be Delegated to Spark

spark2

Case #3: Query Sent Directly to Spark

spark15

Caching

To accelerate recurring queries, Spark caches results of queries (matching criteria set in configuration). If the query result set is cached, and the data didn’t change since the cached results were captured, the query won’t be run and the cached result set will be returned immediately.

There is another option to refresh the cached result sets periodically. If set, users will always get data from cache, while the system will refresh the cache based on in the defined schedule.