Troubleshoot Spark

Troubleshoot common issues with using Spark with Incorta.

You might experience one or more of the following issues when you run queries in Incorta using Spark.

Before you troubleshoot using the common issues on this page, verify that you configured Spark and Incorta according to the minimum configuration standards.

Configuration Issues

The following issues can occur with Spark when you use Spark in Incorta.

Problem: Disk Space

Disk space of some worker(s) is full, thus, queries fail.

Symptoms

  • You send a query to the Spark
  • Query fails with either:
  • “No disk space left on device” error
  • Fail to get the metadata for a query due to “No disk space left on device” error

Side effect: Spark application will restart automatically, hoping that by cleaning itself up, it may free some disk space for the next run

Solution

You may encounter this problem in one of the following scenarios:

  • Spark is badly configured, e.g.: the executor is given too little memory, so, it spills on disk a lot, writing data over and over again
  • The Spark application has been running for a long time without cleanup, so, it has accumulated a lot of logs and metadata
  • The disk space assigned to the worker machine is too small for the query at hand Potential fixes You should inspect which worker disk has been full and either:
  • Check if Spark working directory is on a disk with enough available memory
  • This configuration can be found in: SPARK_HOME/conf/spark-defaults.conf and SPARK_HOME/conf/spark-env.sh
  • For more details, check this section
  • Free some space by deleting unneeded logs and metadata
  • Mounting a new disk and add it as a Spark working directory
  • Tune Spark configuration to be less prone to spill to disk
  • Keep in mind there may be other application running on the same Spark instance, so, you don’t want to greedy in consuming resources

Problem: Spark Did Not Start

Spark application failed to start due to port binding problems. By default, Spark binds to port 25925 (or a port you configured, see Spark Integration configurations). If this port is busy (another process is bound to it) or is not enabled in the first place, Incorta won’t be able to start Spark application.

Symptoms

  • You send a query to Spark
  • Returned error: “Connection error: [org.postgresql.Driver.connect]

Solution

Possible causes

  • Another process is bound to the port
  • An earlier Spark application didn’t close cleanly (probably while stopping or restarting Incorta), thus, it has become a zombie process occupying the port but not useful since it’s not connected to the standing Incorta process
  • The port is not enabled

Potential fixes

  • Make sure the port is enabled
  • Check whether a Spark application is running and occupying the port by running the following command in the machine terminal netstat -tupln | grep 25925 # or the port number configured
  • If an instance is found, kill it using: kill PID # replacing PID with the process ID

Problem: Failure Due to Memory Shortage

Queries sent to Spark may fail due to out of memory problem in one of the executors.

Symptoms

  • Users run a query through Spark
  • Returned error: “Out of Memory Exception”

Side effects

  • Spark will kill the affected executor(s)

Solution

Possible causes

  • Spark configuration is not suitable to the queries being run, e.g.: executors are given too little memory to handle the query

Potential fixes

Problem: External Shuffle Service Is Not Enabled

While you’ve enabled dynamic allocation (elastic scaling) for Spark executors, queries running against Spark fail.

Symptoms

  • User sends a query to Spark which will be fulfilled by Spark
  • Query fails

Solution

Possible causes

  • External shuffle service is not running

Potential fixes

  • Run external shuffle service using: /path/to/spark/sbin/start-shuffle-service.sh

Problem: The Connection Attempt Failed

Incorta cannot connect to Spark.

Symptoms

  • Queries fails with “Connection attempt failed” error.

Solution

Possible causes

  • Spark machine host is cannot be resolved
  • The OS limit for number of processes is lower than required
  • The OS limit for number of open files is lower than required

Potential fixes

  • Make sure Spark master host is reachable, you may want to check /etc/hosts
  • Check the OS limits
  • For number of processes using: ulimit -u
  • For number of open files using: ulimit -n
  • If either is too small for the current workload, you need to increase it, in Ubuntu, the file controlling those value is located in: /etc/security/limits.conf

Problem: Missing Python Libraries error

If you receive an error message regarding missing Python libraries or modules you may need to install new libraries/modules. To resolve this issue:

# from pip /bin
sudo pip install <module>
# from miniconda /bin
sudo conda install <module>

Troubleshoot the Spark Environment

The following issues can occur with the Spark environment when you use Spark in Incorta.

Problem: All Materialized View jobs are failing

In Incorta UI, you see the following error (or similar):

Transformation error: INC_005005001:Failed to load data from spark://frc-incortatest05:7077 at <Materialized view> with properties [error, 2019-01-18 18:39:07 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable]

Solution

Try one of the following solutions to fix the issue:

  • Use the Incorta cluster management console (CMC) to see if the Spark cluster is running on the same server as Incorta.
  • Check whether Incorta and Spark can access the tenant folder. Log in to the Spark server machine to verify.
  • Check whether Incorta and Spark can connect to each other. Ping one from the other to find out.
  • Check that the ports, such as 7077, are open.
  • If a machine only supports IPv6, verify that Spark and Incorta can access each other using IPv6.
  • Check that you are using the same version of Spark that was shipped with your instance of Incorta. Run <Spark Home>/bin/spark-shell --master to verify.
  • The SPARK_HOME variable determines which spark installation will be used from a machine. Please ensure that it is set in the .bacs_profile file of the user which is used to install and start Incorta from the Incorta server. Please set to the user that is used to install and launch the Spark master and worker processes.
  • To use IPv6, set the variable in .bash_profile: export _JAVA_OPTIONS='-[Djava.net] (https://www.google.com/url?q=http://djava.net/&sa=D&ust=1560477646714000). preferIPv6Addresses=true'
  • Spark 2.3.0 and 2.3.1 require an additional fix to support IPv6:
    1. Navigate to <Spark Home>/python/lib.
    2. Create a folder named tmp.
    3. Unzip the file py4j-0.10.7-src.zip.
    4. View the java_gateway.py file.
    5. Replace 127.0.0.1 with ::1
    6. Zip the py4j folder back to the zip file.
    7. Move the fixed zip file back.

Problem: You need to kill a Spark job

Solution

When you need to kill a materialized view Spark job that already started, kill the schema load job from the Incorta UI. If you kill the Spark job from the Spark Web UI, not not in Incorta, the driver process can continue running on the Incorta machine.

Problem: It is not clear if a Materialized View is running

Solution

In the Spark Web UI, you can see Spark jobs in two modes: WAIT or RUNNING. Running jobs produces log messages in the stderr file. Check the latest timestamp to see if the Spark job is running. If the Spark job is in WAITING mode, the job could be waiting for an available resource. Check that the materialized view defines the executor core, max core, and the executor memory. If the materialized view does not define the executor core, max core, and the executor memory, check the defaults defined in the spark-defaults.conf file. You can wait or adjust the definitions. Kill the Incorta schema job from Incorta UI, adjust the definitions, and try again.

Problem: Materialized View started, but is not visible in the Spark Web UI

You can see that the materialized view job started in Incorta UI, but there is no corresponding Spark job visible in Spark Web UI.

Solution

If you set the Always Compact option to off, materialized view jobs that show a "Started" status in the Incorta UI do not display in the Spark Web UI because Incorta is running compaction. To monitor compaction status, view the Incorta tenant log. The Incorta tenant log shows when the compaction started. Compaction builds indexes, which can take a long time for large tables. For compaction issues, try to add resources, like memory, to compaction jobs, and ensure that the Spill to Disk option is off.

Problem: Unhelpful Error Saving or Running a Materialized View

Solution

Where the error occurs determines how you address the issue:

  • An error occurs on the Incorta job and job history page. When Incorta displays a red plus sign in the loader UI, click on it to see the error message. If you do not see a red plus sign, or if you cannot click on it to view an error message, navigate to Schemas > <Schema> > Last Load Status > Select a Job > Check the Job Details and select the red plus sign.
  • An error displays in the Incorta tenant log. Use the grep command to extract the specific log entries for a schema table
  • Spark Web UI. The Spark Web UI runs on the same machine as the Spark master. Find the Spark master URL you are using by navigating in the CMC to Select Clusters > <cluster_name> > CMC > Spark Integration. In a browser, try navigating to http://<spark master host>:9091 to view the Spark Master Web UI. Click on the Application ID of the task you are debugging. Click on stdout to display the log file.
  • An error displays in the Spark Master and Spark log files from the Spark machine. The issue may be caused by an environment issue. For example, the worker process crashed or is not connected to master. Navigate to the Spark home > logs directory to see the log files.
  • Check if Spark executors are created in the Spark machine by using the following command: ps -ef | grep spark
  • Check if the Spark driver is created. Check that the Python program is on the Loader machine. For a new materialized view, save the new materialized view, then check that the Python program is on the Analytics machine.

Problem: Missing table

Solution

Spark MV and Incorta Spark SQL run against parquet files in the compacted folder under <Incorta Tenant directory>/compacted. Check the parquet files in the compacted folder if you are missing a table.

Problem: A Spark materialized view displays differently in the Spark UI and the Spark Web UI

Solution

Check the following areas:

  • Incorta Job
  • Spark Driver process
  • Spark executor process

A job may be created but not yet submitted to Spark.