How to Set Up a Multi Node Hadoop Cluster with Quobyte

In this short tutorial we show you how to set up a multi-node Hadoop big data anlytics cluster with YARN using Quobyte instead of HDFS as the storage backend. You'll be able to share data between Hadoop and other applications through the Quobyte POSIX file system client, the S3 proxy or our Windows client.

Apache Hadoop is the implementation of MapReduce and Apache YARN is a job scheduler for Hadoop. YARN will take care of running your Hadoop jobs in parallel on all your worker nodes.

You can run the Quobyte services on the same machines as the Hadoop workers (hyperconverged) or on dedicated machines. Both scenarios have advantages, see our blog post on Dedicated storage nodes or not? (TODO!!!)

Step 1: Prerequisites

We assume that you have a cluster with three or more nodes:

  • Each node has an IP address, if you work on the cloud use the private IP.
  • You can resolve the hostname for each node, e.g. with dns or by creating a host file on each node (DNS is preferred in a distributed system to avoid having to create and manage the host files on the nodes).
  • You can ssh into each of the nodes and run sudo (or login as root) to do the actual installation.

There will be two roles in the cluster:

  • Master Node
    runs the YARN resource manager. Since we replaced HDFS with Quobyte there is no need for a NameNode here. The master node can also be a worker node.
  • Worker nodes
    run the NodeManager to execute the jobs. Again, no DataNode required since we use Quobyte instead of HDFS.

Please remember that YARN doesn't come with any security enabled by default. You should make sure that your cluster is not accessible from the outside world.

Step 2: Install and Configure Quobyte

If you don't have a Quobyte cluster installed, see our installation quick start here and the recommended hardware here. You can use the installer for both deployment modes: dedicates storage servers or hyperconverged.

Next, make sure to install the Quobyte client on all your hadoop nodes. If you are installing a new Quobyte cluster you can provide a list of client nodes to the Quobyte installer. Create a file called clients.txt with one IP/hostname per line. When the installer asks for the list of clients you can specify this file.

If you already have a running Quobyte cluster you can use the Quobyte installer, too. Run ./install_quobyte add-client --qns-id qns-id user@hostname for each hadoop worker node. The installer tells you the command to add clients and your qns-id:

Step 3: Quobyte Policies for Hadoop

Once your Quobyte cluster is up and running, the devices have been formatted and clients installed you can continue and set the policy for your hadoop volumes. The hadoop policy is designed for any big data application that benefits from data locality, including Apache Spark or HBase. The choice of policy depends on whether you run hyperconverged (worker nodes are also Quobyte storage servers) or with dedicated storage servers.

Policy for Hyperconverged Setups

Go to the Quobyte Webconsole and select the "Policy Rules" tab. Select "Create rule with predefined policy..." from the dropdown

Give your policy a name and add a restriction in the Policy Scope section to limit this policy to volumes that have a hadoop label. Then pick the "Hadoop/Mapreduce on HDDs" policy from the policy definition dropdown.

This policy uses erasure coding to create the redundancy for the files and optimizes placement for hadoop: Stripes are placed locally on the machine that creates the data. All data is stored on HDDs only. Alternatively, you can use the "Hadoop/Mapreduce on SSDs" policy if you have flash. Finally, click on create.

Policy for Dedicated Storage Servers

With dedicated storage servers we don't have the option of storing data locally, as hadoop normally does. We simply use a policy that stores data with erasure coding on HDDs. Alternatively, if you want to use SSDs you can pick the "Automatic EC Redundancy for Files on SSDs" policy.

Create a Quobyte Volume

Next, we have to create a volume to store the data to process and the results. Go to the Volumes tab and select "Create Volume…" from the Volumes dropdown:

Choose a name for your new volume and don't forget to add a label "hadoop" (empty value is ok) so the hadoop policy is used for your new volume.

Finally, get your registry address from the Quobyte UI (you'll need it later):

Step 4: Prepare the Hadoop Install

Before we can start with the actual installation we have to prepare the cluster by installing java, creating a user hadoop and enable ssh login with a key from the master node. Pick any of the machines to be your master node, then execute the follwing steps:

1
Install the OpenJDK on all nodes, e.g. on centos you just run
sudo yum install -y java-8-openjdk-devel
2
Install the Quobyte hadoop driver on all nodes:
sudo yum install -y quobyte-hadoop
3
Create the hadoop user on each node with
sudo useradd hadoop
. This user will be used to install and run Apache Hadoop and YARN.
4
YARN uses ssh to control the worker nodes. Create a ssh key for the hadoop user on the master node:
sudo -u hadoop ssh-keygen
5
Add the contents of /home/hadoop/.ssh/id_rsa.pub to /home/hadoop/.ssh/authorized_keys on all nodes.
Make sure not to overwrite the file and if you have to create the .ssh directory or the authorized_keys file make sure to set the permissions correctly:
sudo -u hadoop bash
mkdir /home/hadoop/.ssh/
chmod 0700 /home/hadoop/.ssh
echo "" >> /home/hadoop/.ssh/authorized_keys
chmod 0600 /home/hadoop/.ssh/authorized_keys
                            
6
ssh from your master node to all other nodes (including the master node itself) to verify that you can log in witout password.

Step 5: Install Hadoop

Finally, we can install Hadoop itself:

1
Log into the master node as user hadoop
2
Download hadoop on the master (pick a mirror that works for you here)
wget https://apache-mirror.tld/apache/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
3
Extract the tar archive
tar xzf hadoop-3.3.1.tar.gz
4
Rename the directory to hadoop
mv hadoop-3.3.1 hadoop
5
Add hadoop binaries to path for convenience. Add the following line to /home/hadoop/.profile:
PATH=/home/hadoop/hadoop/bin:/home/hadoop/hadoop/sbin:$PATH
and the following to your .bashrc file:
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin
Don't froget to run . ~/.bashrc to update the PATH variable in your current bash shell.
6
Find your the path for your JAVA_HOME, e.g. with alternatives --display java.
Take the path which says "link currently points to" and remove the trailing java/bin Configure the JAVA_HOME in /home/hadoop/etc/hadoop/hadoop-env.sh. Add a line like this and replace with your JDK
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/jre"

Step 6: Configure Hadoop and YARN

Next, we have to configure Hadoop to use Quobyte as the default file system and YARN. Execute the following steps on the master node.

1
Log into the master node as user hadoop
2
Copy the Quobyte driver JAR into the Hadoop directory
cp /opt/quobyte/hadoop/quobyte_hadoop_plugin.jar /home/hadoop/hadoop/share/hadoop/hdfs/
3
Configure Quobyte as the file system. Some applications, like YARN, use the AbstractFileSystem - Quobyte provides both drivers. Edit /home/hadoop/hadoop/etc/hadoop/core-site.xml and add the following to the <configuration> section of the file:
<property>
    <name>fs.defaultFS</name>
    <value>quobyte:///</value>
</property>
<property>
  <name>fs.quobyte.impl</name>
  <value>com.quobyte.hadoop.interfaces.QuobyteFileSystemAdapter</value>
</property>
<property>
  <name>fs.AbstractFileSystem.quobyte.impl</name>
<value>com.quobyte.hadoop.interfaces.QuobyteAbstractFileSystemAdapter</value>
</property>
<property>
  <name>com.quobyte.hadoop.backend</name>
  <value>JNI</value>
</property>
<property>
  <name>com.quobyte.hadoop.registry</name>
  <!-- Replace this with your registries address -->
  <value>qns-id.myquobyte.net</value>
</property>
<property>
    <name>com.quobyte.hadoop.volume</name>
    <!-- Replace hadoop-volume with your Quobyte volume name-->
    <value>demo-1</value>
</property>
4
Configure YARN as the default scheduler. Add the following to the <configuration> section in /home/hadoop/hadoop/etc/hadoop/mapred-site.xml:
<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
<property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
</property>
5
Configure YARN in /home/hadoop/hadoop/etc/hadoop/yarn-site.xml and add the following to the <configuration> section:
<property>
        <name>yarn.acl.enable</name>
        <value>0</value>
</property>
<property>
        <name>yarn.resourcemanager.hostname</name>
        <!-- Replace with your master node address/IP -->
        <value>Insert Your Master Node's IP address</value>
</property>
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
6
Put the list of IPs of your workers in /home/hadoop/hadoop/etc/hadoop/workers
7
Tar up the hadoop directory, copy to workers and extract there:
cd /home/hadoop
tar czf hadoop-configured.tgz hadoop
for h in `cat /home/hadoop/hadoop/etc/hadoop/workers`
do
  scp hadoop-configured.tgz $h:
  ssh $h "tar xzf hadoop-configured.tgz"
done

Step 6: Test the Quobyte HDFS Driver

Time to check that your Quobyte driver is working properly and is able to contact your Quobyte cluster. Execute the following command on the master node and check that it doesn't report any errors. You'll see some logging output from the Quobyte driver in the console.

hdfs dfs -mkdir -p /testdir

Step 7: Start YARN and run a Test

The last step is to start YARN and submit a mapreduce job to the scheduler.

1
Start the YARN scheduler on the master node. The master will ssh into the worker nodes and start the YARN processes there.
start-yarn.sh
2
Ensure that all workers are ready by running this command on the master node:
yarn node -list
3
Log into any node with a Quobyte client, e.g. the master or any worker. Run the following commands to download sample data to use for our demo. This will download the text of three books:
cd /quobyte/demo-1
mkdir demo-data
cd demo-data
wget -O two-cities.txt https://www.gutenberg.org/files/98/98-0.txt
wget -O metamorphosis.txt https://www.gutenberg.org/files/5200/5200-0.txt
wget -O peter-pan.txt https://www.gutenberg.org/files/16/16-0.txt
4
Submit the mapreduce job to YARN from the master node. This job will count the occurrence of each word in the books:
yarn jar ~/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar \
wordcount "/demo-data/*" wordcount-result
The output will be on the volume in /quobyte/demo01/wordcount-result

You can now run distributed hadoop jobs directly on Quobyte and benefit from the easy data sharing between Hadoop, S3 and the Linux world.

Learn More about Quobyte for Big Data and Enterprise Analytics