Should I run my storage on the same nodes that I use as Hadoop workers, or should I have dedicated storage servers and worker nodes that only do compute?
This question has been puzzling data administrators since Hadoop first became popular in the late 2000s. Even if you don’t use Hadoop but use other big data frameworks or applications like Spark or HBase, or Kubernetes with or without big data, the problem is the same: To combine or not to combine storage and compute?
So why did Hadoop make both storage and compute on the same server popular, what changed since Hadoop was launched and what benefits does this have today?
To answer these questions, we must first understand Hadoop design.
When Hadoop was released in 2006 networks were slow – 1GBit was the norm even for servers – and bandwidth for east-west-traffic between racks was very limited.
Moving large amounts of data between racks was a lengthy process, so bringing the job to the data was the only option. Storing replicas of data on the same machine where the compute jobs run was an important feature of Hadoop to deal with the slow network connections.
Another aspect of the Hadoop design, which was inspired by Google’s infrastructure and MapReduce (https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf), was to rely entirely on commodity hardware. The entire Hadoop system, including the storage, was 100% software that would run on almost any hardware platform.
Finally, the only way to economically store large amounts of data was hard drives – and even today that is true since the amount of data stored has increased dramatically.
Hadoop is batch processing and runs only large sequential reads to play well with hard drives. The IO workload created by Hadoop is homogeneous and fairly easy to handle, and variance in latencies caused by compute jobs don’t matter much.
Advantages of Combined Storage and Compute Nodes in Hadoop
The decision to run the Hadoop File System (HDFS) on every Hadoop node came as the logical conclusion of the two design goals and resulted in four main advantages:
- Data Locality
When each storage node is also a compute node you can schedule the processing on the same node as the data. No data needs to be read over the network and computation can be as fast as the local drives.
When servers do both compute and storage you can standardize your Hadoop cluster (or even your entire data center) on a single server configuration. This reduces your cost – discounts when buying in larger quantities – and management overhead – maintaining a single type of hardware is easier than dealing with many different models. The hyperscalers like Google have demonstrated how much this can save.
- Easy scaling
When you add servers both compute and storage grow in lock-step, which is often an advantage in Hadoop clusters where both are correlated.
- Less hardware
You don’t need extra hardware for the storage servers, which reduces cost and often the space footprint in the racks.
It almost sounds like this is a no-brainer: Combined storage and compute servers have so many advantages. However, a lot has changed since Hadoop was initially released.
Fast Forward 16 Years …
A major change was that networks became much faster (and the hardware cheaper). Today 100G ethernet has become the norm for connecting servers, and within a rackmount we have massive bandwidth available between servers.
To put that into perspective: In 2006 your hard drive probably delivered 150MB/s on sequential reads, and your 1G ethernet pushed 100MB/s. Today, your hard drive is still at +/- 150MB/s but your 100G ethernet can transfer 10GB/s.
Of course, there is flash storage – but going flash only is often too expensive especially when you have to manage 10s of petabytes of storage.
So, reading from other servers in the same rack is not an issue anymore, however, east-west-traffic is still limited – often to 100G – compared to the bandwidth within a rack.
Therefore, staying inside the same rack is still a good idea. With this change, you now have the option to put dedicated storage servers in each rack and benefit from data locality within the rack – if your storage system supports this (btw, Quobyte does, wink, wink).
The massive intra-rack-bandwidth enabled another change: Rack-aware erasure coding for data protection instead of Hadoop’s default 3x replication.
When you have enough servers in a rack that do storage, you can erasure code the data across 11 machines (8+3 encoding) and reduce your storage footprint from a factor 3 to a factor 1.375.
The other thing that changed is that analytics isn’t just Hadoop anymore. New frameworks like Apache Spark or Flink have replaced Hadoop or live side-by-side in the same cluster.
These frameworks are for real-time, or stream processing and the IO patterns have shifted to smaller IO compared to Hadoop’s purely streaming IO. In addition, other tools used to analyze large amounts of data have made it into the enterprise like machine learning.
Often, the data used for machine learning is the same that you’d run big data analytics on, however, machine learning has yet again changed the IO patterns to massive numbers of small files and the need for lower latency and flash.
Ok, Which option should I pick today? Dedicated or Shared? Let’s look at the pros and cons of both models in modern infrastructures and with a broader range of use-cases:
Shared Servers for Storage and Compute
Dedicated Storage Server
The short answer is that if you use the storage system for other applications besides Hadoop, in particular ML/AI or databases, then you should strongly consider going for dedicated storage servers.
Quobyte for Analytics Workloads
Quobyte is a distributed parallel file system entirely in software. That means you can run Quobyte in both deployment modes: On dedicated storage servers or together with your compute.
Unlike HDFS Quobyte is a full POSIX file system that you can use for a broader range of applications, including transactional databases, VMs or machine learning.
When using Quobyte with the native HDFS driver you get the same benefits of locality that you get with HDFS. Hadoop jobs will run close to the data. In addition, Quobyte supports locality aware Erasure Coding, e.g. all stripes inside the same rack.
If your storage is purely for Hadoop and other applications from the Big Data ecosystem (like Spark) you are better off with shared compute and storage servers.
Otherwise, if you share the storage with other applications, in particular those requiring high performance and low latency, dedicated storage servers are the better choice.
Read more about how to use Quobyte with Hadoop, Spark, or our native HDFS driver here.