Configure Spark to Work With Hadoop on Windows
Configure Spark to Work with Hadoop on Windows
If you plan to use Spark on Windows, you must perform the following steps to configure Spark. This allows you to use Spark with Hadoop on your Windows machine. If you use a Linux machine, you do not need to perform these or any additional steps to use Spark.
Requirements:
- VC++2015
- Spark without Hadoop
- Hadoop 3.2
To configure external Spark to work with Hadoop on Windows:
- Install Incorta, but do not start the Cluster Management Console (CMC).
- Copy
winutils.exe
andhadoop.dll
to the bin folder of Hadoop 3.2. - Set the
HADOOP_HOME
environmental variable to the Hadoop 3.2 folder. - Add to the PATH environmental variable:
%HADOOP_HOME%/bin
- In the terminal, browse to the Hadoop 3.2 bin directory then run Hadoop classpath.
- Copy the classpath value to a text file.
- Run the hostname.
- Copy the hostname value to a text file. You will use it as the hostname.
- Copy the output to
spark-env.sh
which should look like this:
{{set SPARK_PUBLIC_DNS=(hostname) }}{{set SPARK_MASTER_IP=(hostname) }}{{set SPARK_MASTER_PORT=7077 }}{{set SPARK_MASTER_WEBUI_PORT=9091 }}{{set SPARK_WORKER_PORT=7078 }}{{set SPARK_WORKER_WEBUI_PORT=9092 }}{{set SPARK_WORKER_MEMORY=8g }}set SPARK_DIST_CLASSPATH=(value of hadoop classpath copied as is)
- In the sbin folder of spark 2.4.3, create two cmd files:
- Name:
start-master.cmd
. Content:../bin/spark-class org.apache.spark.deploy.master.Master
- Name:
start-slave.cmd
. Content:./bin/spark-class org.apache.spark.deploy.worker.Worker
spark://(hostname):7077
.
- Name:
- Run
start-master.cmd
. - Run
start-slave.cmd
. - In the CMC, install the loader and analytics service.
- In Spark, select the external version and use
spark://(hostname):7077
as the master.