Spark Installation on Hadoop

(Last Updated On: )

In this tutorial I will show you how to use Kerberos/SSL with Spark integrated with Yarn. I will use self signed certs for this example. Before you begin ensure you have installed Kerberos Server and Hadoop.

This assumes your hostname is “hadoop”

Create Kerberos Principals

cd /etc/security/keytabs/

sudo kadmin.local

#You can list princepals
listprincs

#Create the following principals
addprinc -randkey spark/hadoop@REALM.CA

#Create the keytab files.
#You will need these for Hadoop to be able to login
xst -k spark.service.keytab spark/hadoop@REALM.CA

Set Keytab Permissions/Ownership

sudo chown root:hadoopuser /etc/security/keytabs/*
sudo chmod 750 /etc/security/keytabs/*

Download

Go to Apache Spark Download and get the link for Spark.

wget http://apache.forsale.plus/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
mv spark-2.4.4-bin-hadoop2.7 /usr/local/spark/

Update .bashrc

sudo nano ~/.bashrc

#Ensure we have the following in the Hadoop section
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

#Add the following

#SPARK VARIABLES START
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
#SPARK VARIABLES STOP

source ~/.bashrc

Setup Configuration

cd /usr/local/spark/conf
mv spark-defaults.conf.template spark-defaults.conf
nano spark-defaults.conf

#Add to the end
spark.master                            yarn
spark.yarn.historyServer.address        ${hadoopconf-yarn.resourcemanager.hostname}:18080
spark.yarn.keytab                       /etc/security/keytabs/spark.service.keytab
spark.yarn.principal                    spark/hadoop@REALM.CA
spark.yarn.access.hadoopFileSystems     hdfs://NAMENODE:54310
spark.authenticate                      true
spark.driver.bindAddress                0.0.0.0
spark.authenticate.enableSaslEncryption true
spark.eventLog.enabled                  true
spark.eventLog.dir                      hdfs://NAMENODE:54310/user/spark/applicationHistory
spark.history.fs.logDirectory           hdfs://NAMENODE:54310/user/spark/applicationHistory
spark.history.fs.update.interval        10s
spark.history.ui.port                   18080

#SSL
spark.ssl.enabled                       true
spark.ssl.keyPassword                   PASSWORD
spark.ssl.keyStore                      /etc/security/serverKeys/keystore.jks
spark.ssl.keyStorePassword              PASSWORD
spark.ssl.keyStoreType                  JKS
spark.ssl.trustStore                    /etc/security/serverKeys/truststore.jks
spark.ssl.trustStorePassword            PASSWORD
spark.ssl.trustStoreType                JKS

Kinit

kinit -kt /etc/security/keytabs/spark.service.keytab spark/hadoop@REALM.CA
klist
hdfs dfs -mkdir /user/spark/
hdfs dfs -mkdir /user/spark/applicationHistory
hdfs dfs -ls /user/spark

Start The Service

$SPARK_HOME/sbin/start-history-server.sh

Stop The Service

$SPARK_HOME/sbin/stop-history-server.sh

Spark History Server Web UI

References

I used a lot of different resources and reference material on this. Below are just a few I used.

https://spark.apache.org/docs/latest/running-on-yarn.html#configuration

https://spark.apache.org/docs/latest/security.html

https://www.linode.com/docs/databases/hadoop/install-configure-run-spark-on-top-of-hadoop-yarn-cluster/