Tutorial

How to Set Up a Multi Node Apache Spark Cluster with Quobyte

This tutorial walks you through the steps of installing Apache Sparks on multiple nodes for distributed processing with Quobyte as the storage backend.

Contents

Reading Time: 3 minutes

The initial preparation of the cluster is the same as with Apache Hadoop. Please follow steps 1 through 4 of our Hadoop tutorial to prepare your nodes and install and configure Quobyte.

Once the basic cluster setup and Quobyte installation is done, you can start with Apache Spark. Spark knows two roles for your machines: Master Node and Worker Node. The master coordinates the distribution of work and the workers, well they do the actual work.

By default a spark cluster has no security. Make sure your cluster is not accessible from the outside world.

Install Apache Spark

Log into master node as user hadoop to install Apache Spark. We tested this tutorial with version 3.1.2:

  1. Download Spark

    Download Spark "Pre-built for Apache Hadoop" from https://spark.apache.org/downloads.html. Copy the URL for the suggested mirror and download on the master node, e.g. with wget

  2. Extract the tar archive

    spark-3.1.2-bin-hadoop3.2.tgz

  3. Rename the directory to spark

    mv spark-3.1.2-bin-hadoop3.2 spark

  4. Change into the spark directory

    cd spark

  5. Copy configuration templates
    cp conf/spark-env.sh.template conf/spark-env.sh
    cp conf/spark-defaults.conf.template conf/spark-defaults.conf
    cp conf/workers.template conf/workers
  6. Add the following line to conf/spark-env.sh and replace the IP with the IP of your master node:

    SPARK_MASTER_HOST=10.128.0.3

  7. Edit conf/spark-defaults.conf and add the following lines. Make sure to replace the registry address with yours!
    spark.hadoop.fs.quobyte.impl=com.quobyte.hadoop.interfaces.QuobyteFileSystemAdapter
    spark.hadoop.com.quobyte.hadoop.backend=JNI
    spark.hadoop.com.quobyte.hadoop.registry=YOUR_QNS_ID.myquobyte.net
    spark.hadoop.com.quobyte.hadoop.volume=test-1
  8. Add your workers (one IP per line) to conf/workers. You can remove the line with localhost if you don't want the master node to also act as a worker.
  9. Log into any node with a Quobyte client, e.g. the master or any worker. Run the following command to download the contents of the book Peter Pan as sample data for our demo.
    cd /quobyte/demo-1
    mkdir demo-data
    cd demo-data
    wget -O peter-pan.txt https://www.gutenberg.org/files/16/16-0.txt
  10. Time for the first test! Run bin/spark-shell locally on the master node to check that your Quobyte driver is working as expected. Execute the following code snippet, it should output the number of characters in each Spark partition:
    sc.setLogLevel("INFO")
    val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt")
    textFile.foreachPartition(iter => { var i = 0;
    while (iter.hasNext) { i += iter.next().length; };
    println("Num characters in partition: " + i); })
  11. If your test was successful you can continue and pack the configured spark directory and copy it to the worker nodes:
    cd ..
    tar czf spark-configured.tgz
    for h in `cat /home/hadoop/spark/conf/workers`
    do
      scp spark-configured.tgz $h:
      ssh $h "tar xzf spark-configured.tgz"
    done
  12. Start the Spark master:

    This will output something similar to this:
    starting org.apache.spark.deploy.master.Master, logging to /home/hadoop/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-quobyte-hadoop-4l3h.out

    sbin/start-master.sh
  13. Check the log file to make sure the spark master is running. It will also show you the URL for the web interface and the master URL. Copy the master URL, you'll need it later.
  14. Start the Spark workers. This script will log into the worker nodes via ssh (that's why we configured ssh keys) and start the workers:

    sbin/start-workers.sh

  15. Open your browser and go to the spark web interface (see master log for URL). You should see your workers:

Run a Distributed Test

  1. Now that your multi node Spark cluster with Quobyte is up and running we can run a distributed spark jobs:

    Run spark-shell with the master option. In this mode the spark shell talks to the master to run the actual computation on the worker nodes. Replace MASTER_URL with your master URL (see above):

    bin/spark-shell --master MASTER_URL
  2. Run a distributed word count to see how many words are in the book Peter Pan:
    val textFile = sc.textFile("quobyte:///demo-data/peter-pan.txt");
    val counts = textFile.flatMap(line => line.split(" "))
                     .map(word => (word, 1))
                     .reduceByKey(_ + _)
    println(counts.count())
  3. Now, go to the Volumes tab, and create a new Volume. You can just call it “kernel” and leave everything else as is, and click on the “Create” button.

Congratulations

You now have a multi node Spark cluster with Quobyte. You can take advantage of all the interfaces Quobyte offers to easily load data into your Quobyte cluster via the file system driver for Linux or Windows, or the S3/object interface. Thanks to Quobyte’s scalable performance you can run Spark jobs across hundreds or thousands of worker nodes without any storage bottlenecks.

Talk to Us

Quobyte can do a lot more for you than what you’ve seen so far.

To find out what, contact us to set up a quick demo.