References → SparkX Access to Cloud Storage

Context

This content applies to 2024.1.x On-Premises installations.

The unified Spark version bundled with Incorta starting 2024.7.x has the required jars under this path /<incorta_installation_path>/IncortaNode/spark/jars.

Starting 2024.1.3, SparkX requires access to the Parquet files to enable the discovery of non-optimized tables via the Advanced SQLi. For Cloud installations and On-Premises installations on local servers, SparkX can automatically access Parquet files after configuring the Advanced SQLi. However, for On-Premises tenants that use a cloud storage file system, such as Azure, AWS, and Google Cloud Storage (GCS), you must manually set some additional configurations to allow SparkX access to Parquet files on these cloud storage services.

Note

These configurations are required to allow Incorta to monitor the queries run via the Advanced SQLi against tenants stored on these cloud storage services.

Restart all services

After setting the required configurations, you must restart all services, including the Loader, Analytics, and Advanced SQLi services. You must also restart SparkX by running the following commands: ./stopSparkX.sh and ./startSparkX.sh.

GCS configurations

Here are the steps required to allow SparkX to access Parquet files on Google Cloud Service:

  1. Copy the core-site.xml, from /<incorta_installation_path>/IncortaNode/runtime/lib/ for example, to /IncortaNode/sparkX/conf.
  2. Copy the gcs-connector-hadoop3-2.2.11-shaded.jar from the /IncortaNode/spark/jars directory to /IncortaNode/sparkX/custom-jars.
  3. Navigate to the following path: /<incorta_installation_path>/IncortaNode/kyuubi/services/<service_GUID>/conf/ and add the following configurations to kyuubi-defaults.conf:
    • spark.driver.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
    • spark.executor.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
  4. Restart the services.

Azure configurations

Here are the steps required to allow SparkX to access Parquet files on Microsoft Azure:

  1. Copy the core-site.xml, from /<incorta_installation_path>/IncortaNode/runtime/lib/ for example, to /IncortaNode/sparkX/conf.
  2. Copy the following .jar files from /IncortaNode/Spark/jars to /IncortaNode/sparkX/custom-jars:
    • azure-data-lake-store-sdk-2.3.9.jar
    • azure-keyvault-1.0.0.jar
    • azure-storage-7.0.1.jar
    • hadoop-azure-3.3.4.jar
    • hadoop-azure-datalake-3.3.4.jar
  3. Navigate to the following path: /<incorta_installation_path>/IncortaNode/kyuubi/services/<service_GUID>/conf/ and add the following configurations to kyuubi-defaults.conf:
    • spark.driver.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
    • spark.executor.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
  4. Restart the services.

AWS configurations

Here are the steps required to allow SparkX to access Parquet files on Amazon AWS:

  1. Copy the core-site.xml, from /<incorta_installation_path>/IncortaNode/runtime/lib/ for example, to /IncortaNode/sparkX/conf.
  2. Copy the following .jar files from /IncortaNode/Spark/jars to /IncortaNode/sparkX/custom-jars:
    • aws-java-sdk-1.12.262.jar
    • aws-java-sdk-core-1.12.262.jar
    • aws-java-sdk-dynamodb-1.12.262.jar
    • aws-java-sdk-s3-1.12.262.jar
    • hadoop-aws-3.3.4.jar
  3. Navigate to the following path: /<incorta_installation_path>/IncortaNode/kyuubi/services/<service_GUID>/conf/ and add the following configurations to kyuubi-defaults.conf:
    • spark.driver.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
    • spark.executor.extraClassPath=<incorta_installation_path>/IncortaNode/sparkX/custom-jars/*
  4. Restart the services.