The initial preparation of the cluster is the same as with Apache Hadoop. Please follow steps 1 through 4 of our Hadoop tutorial to prepare your nodes and install and configure Quobyte.
Once the basic cluster setup and Quobyte installation is done, you can start with Apache Spark. Spark knows two roles for your machines: Master Node and Worker Node. The master coordinates the distribution of work and the workers, well they do the actual work.
By default a spark cluster has no security. Make sure your cluster is not accessible from the outside world.
Install Apache Spark
Log into master node as user hadoop
to install Apache Spark. We tested this tutorial with version 3.1.2:
-
Download Spark
Download Spark "Pre-built for Apache Hadoop" from https://spark.apache.org/downloads.html. Copy the URL for the suggested mirror and download on the master node, e.g. with
wget
-
Extract the tar archive
spark-3.1.2-bin-hadoop3.2.tgz
-
Rename the directory to spark
mv spark-3.1.2-bin-hadoop3.2 spark
-
Change into the spark directory
cd spark
-
Copy configuration templates
cp conf/spark-env.sh.template conf/spark-env.sh cp conf/spark-defaults.conf.template conf/spark-defaults.conf cp conf/workers.template conf/workers
-
Add the following line to conf/spark-env.sh and replace the IP with the IP of your master node:
SPARK_MASTER_HOST=10.128.0.3
-
Edit conf/spark-defaults.conf and add the following lines. Make sure to replace the registry address with yours!
spark.hadoop.fs.quobyte.impl=com.quobyte.hadoop.interfaces.QuobyteFileSystemAdapter spark.hadoop.com.quobyte.hadoop.backend=JNI spark.hadoop.com.quobyte.hadoop.registry=YOUR_QNS_ID.myquobyte.net spark.hadoop.com.quobyte.hadoop.volume=test-1
-
Add your workers (one IP per line) to conf/workers. You can remove the line with localhost if you don't want the master node to also act as a worker.
-
Log into any node with a Quobyte client, e.g. the master or any worker. Run the following command to download the contents of the book Peter Pan as sample data for our demo.
cd /quobyte/demo-1 mkdir demo-data cd demo-data wget -O peter-pan.txt https://www.gutenberg.org/files/16/16-0.txt
-
Time for the first test! Run bin/spark-shell locally on the master node to check that your Quobyte driver is working as expected. Execute the following code snippet, it should output the number of characters in each Spark partition:
sc.setLogLevel("INFO") val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt") textFile.foreachPartition(iter => { var i = 0; while (iter.hasNext) { i += iter.next().length; }; println("Num characters in partition: " + i); })
-
If your test was successful you can continue and pack the configured spark directory and copy it to the worker nodes:
cd .. tar czf spark-configured.tgz for h in `cat /home/hadoop/spark/conf/workers` do scp spark-configured.tgz $h: ssh $h "tar xzf spark-configured.tgz" done
-
Start the Spark master:
This will output something similar to this:
starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-quobyte-hadoop-4l3h.outsbin/start-master.sh
-
Check the log file to make sure the spark master is running. It will also show you the URL for the web interface and the master URL. Copy the master URL, you'll need it later.
-
Start the Spark workers. This script will log into the worker nodes via ssh (that's why we configured ssh keys) and start the workers:
sbin/start-workers.sh
-
Open your browser and go to the spark web interface (see master log for URL). You should see your workers:
Run a Distributed Test
-
Now that your multi node Spark cluster with Quobyte is up and running we can run a distributed spark jobs:
Run
spark-shell
with the master option. In this mode the spark shell talks to the master to run the actual computation on the worker nodes. Replace MASTER_URL with your master URL (see above):bin/spark-shell --master MASTER_URL
-
Run a distributed word count to see how many words are in the book Peter Pan:
val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt"); val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) println(counts.count())
-
Now, go to the Volumes tab, and create a new Volume. You can just call it “kernel” and leave everything else as is, and click on the “Create” button.
Congratulations
You now have a multi node Spark cluster with Quobyte. You can take advantage of all the interfaces Quobyte offers to easily load data into your Quobyte cluster via the file system driver for Linux or Windows, or the S3/object interface. Thanks to Quobyte’s scalable performance you can run Spark jobs across hundreds or thousands of worker nodes without any storage bottlenecks.