Should I run my storage on the same nodes that I use as Hadoop workers, or should I have dedicated storage servers and worker nodes that only do compute?
This question has been puzzling data administrators since Hadoop first became popular in the late 2000s. Even if you don’t use Hadoop but use other big data frameworks or applications like Spark or HBase, or Kubernetes with or without big data, the problem is the same: To combine or not to combine storage and compute?
So why did Hadoop make both storage and compute on the same server popular, what changed since Hadoop was launched and what benefits does this have today?
To answer these questions, we must first understand Hadoop design.
Advantages of Combined Storage and Compute Nodes in Hadoop
The decision to run the Hadoop File System (HDFS) on every Hadoop node came as the logical conclusion of the two design goals and resulted in four main advantages:
- Data Locality
When each storage node is also a compute node you can schedule the processing on the same node as the data. No data needs to be read over the network and computation can be as fast as the local drives.
When servers do both compute and storage you can standardize your Hadoop cluster (or even your entire data center) on a single server configuration. This reduces your cost – discounts when buying in larger quantities – and management overhead – maintaining a single type of hardware is easier than dealing with many different models. The hyperscalers like Google have demonstrated how much this can save.
- Easy scaling
When you add servers both compute and storage grow in lock-step, which is often an advantage in Hadoop clusters where both are correlated.
- Less hardware
You don’t need extra hardware for the storage servers, which reduces cost and often the space footprint in the racks.
It almost sounds like this is a no-brainer: Combined storage and compute servers have so many advantages. However, a lot has changed since Hadoop was initially released.
Shared Servers for Storage and Compute
Dedicated Storage Server
The short answer is that if you use the storage system for other applications besides Hadoop, in particular ML/AI or databases, then you should strongly consider going for dedicated storage servers.
Quobyte for Analytics Workloads
Quobyte is a distributed parallel file system entirely in software. That means you can run Quobyte in both deployment modes: On dedicated storage servers or together with your compute.
Unlike HDFS Quobyte is a full POSIX file system that you can use for a broader range of applications, including transactional databases, VMs or machine learning.
When using Quobyte with the native HDFS driver you get the same benefits of locality that you get with HDFS. Hadoop jobs will run close to the data. In addition, Quobyte supports locality aware Erasure Coding, e.g. all stripes inside the same rack.
If your storage is purely for Hadoop and other applications from the Big Data ecosystem (like Spark) you are better off with shared compute and storage servers.
Otherwise, if you share the storage with other applications, in particular those requiring high performance and low latency, dedicated storage servers are the better choice.
Read more about how to use Quobyte with Hadoop, Spark, or our native HDFS driver here.