How to Set Up a Multi Node Apache Spark Cluster with Quobyte

This tutorial walks you through the steps of installing Apache Sparks on multiple nodes for distributed processing with Quobyte as the storage backend.

The initial preparation of the cluster is the same as with Apache Hadoop. Please follow steps 1 through 4 of our Hadoop tutorial to prepare your nodes and install and configure Quobyte.

Once the basic cluster setup and Quobyte installation is done, you can start with Apache Spark. Spark knows two roles for your machines: Master Node and Worker Node. The master coordinates the distribution of work and the workers, well they do the actual work.

By default a spark cluster has no security. Make sure your cluster is not accessible from the outside world.

Install Apache Spark

Log into master node as user hadoop to install Apache Spark. We tested this tutorial with version 3.1.2:

Download Spark "Pre-built for Apache Hadoop" from Copy the URL for the suggested mirror and download on the master node, e.g. with wget
Extract the tar archive
Rename the directory to spark
mv spark-3.1.2-bin-hadoop3.2 spark
Change into the spark directory
cd spark
Copy configuration templates
cp conf/ conf/
cp conf/spark-defaults.conf.template conf/spark-defaults.conf
cp conf/workers.template conf/workers
Add the following line to conf/ and replace the IP with the IP of your master node:
Edit conf/spark-defaults.conf and add the following lines. Make sure to replace the registry address with yours!
Add your workers (one IP per line) to conf/workers. You can remove the line with localhost if you don't want the master node to also act as a worker.
Log into any node with a Quobyte client, e.g. the master or any worker. Run the following command to download the contents of the book Peter Pan as sample data for our demo.
cd /quobyte/demo-1
mkdir demo-data
cd demo-data
wget -O peter-pan.txt
Time for the first test! Run bin/spark-shell locally on the master node to check that your Quobyte driver is working as expected. Execute the following code snippet, it should output the number of characters in each Spark partition:
val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt")
textFile.foreachPartition(iter => { var i = 0;
while (iter.hasNext) { i +=; };
println("Num characters in partition: " + i); })
If your test was successful you can continue and pack the configured spark directory and copy it to the worker nodes:
cd ..
tar czf spark-configured.tgz
for h in `cat /home/hadoop/spark/conf/workers`
  scp spark-configured.tgz $h:
  ssh $h "tar xzf spark-configured.tgz"
Start the Spark master:
This will output something similar to this:
starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-quobyte-hadoop-4l3h.out
Check the log file to make sure the spark master is running. It will also show you the URL for the web interface and the master URL. Copy the master URL, you'll need it later.
Start the Spark workers. This script will log into the worker nodes via ssh (that's why we configured ssh keys) and start the workers:
Open your browser and go to the spark web interface (see master log for URL). You should see your workers:

Run a Distributed Test

Now that your multi node Spark cluster with Quobyte is up and running we can run a distributed spark jobs:

Run spark-shell with the master option. In this mode the spark shell talks to the master to run the actual computation on the worker nodes. Replace MASTER_URL with your master URL (see above):

bin/spark-shell --master MASTER_URL

Run a distributed word count to see how many words are in the book Peter Pan:

val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt");
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)


You now have a multi node Spark cluster with Quobyte. You can take advantage of all the interfaces Quobyte offers to easily load data into your Quobyte cluster via the file system driver for Linux or Windows, or the S3/object interface. Thanks to Quobyte's scalable performance you can run Spark jobs across hundreds or thousands of worker nodes without any storage bottlenecks.

Learn More about Quobyte for Big Data and Enterprise Analytics