What is HDFS, the Hadoop File System?
Hadoop Distributed File System (HDFS for short) is the primary data storage system used by Apache Hadoop applications to manage large amounts of data and support related big data analytics applications. The Hadoop File System (inspired by Google's large-scale software-based infrastructure) was developed mainly to provide cost-effective and scalable storage for MapReduce workloads. It is most commonly used to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
The HDFS Architecture
HDFS has three components:
There is a single NameNode per cluster. This service stores the file system metadata (e.g., the directory structure), and the file names, and also, the location where the blocks of a file are stored.
These services usually run on each server that also does the compute. Each DataNode stores blocks from many files. The DataNodes are also responsible for replicating these blocks to other DataNodes to protect files against data loss.
The client provides access to the file system and the Hadoop/MapReduce jobs that want to access the data. The client, a piece of software embedded in the Hadoop distribution, communicates with the NameNode to find the file and retrieve the block's locations. Then the client reads or writes the blocks directly from the DataNodes.
Often, you will read the term HDFS-API. HDFS is not the only storage system that works with Hadoop. Other storage systems implement their own clients that have the same API as the HDFS client.
How HDFS is Different from "normal" File Systems
While HDFS is called a file system, it is quite different from traditional POSIX file systems. HDFS has been designed for a single workload - MapReduce - and doesn't support generic workloads. HDFS is also write-once, which means that files can be written once and then only be read. This is perfect for Hadoop and other big data analytics applications; however, you wouldn't be able to run a transactional database or virtual machines on HDFS.
In addition, HDFS is tuned towards large sequential reads from large files - it has a chunk size of 64MB. Hadoop was designed to work with large amounts of data, and it had to be optimized to work efficiently on hard drive storage.
Limitations of HDFS
The focus on this IO pattern was perfect for Hadoop/MapReduce, but the IO patterns have changed due to frameworks like Apache Spark and the omnipresence of machine learning. The drop in flash prices has made - at least partial - flash tiers very feasible for large-scale analytics clusters. There are two limitations in HDFS for those newer analytic workloads:
- NameNode limitations
Each HDFS cluster has one name node responsible for storing metadata information, like the filename and where the file’s content is stored. This single node limits the number of files you can have in each cluster. This is a severe limitation when you deal with millions of small files, typical for machine learning.
- Smaller and random IO
Hadoop has not been designed for concurrent high IOPS, which is the IO pattern where flash drives excel. Again, many small files and low latency requirements from machine learning are sources of such IO patterns.
If you decide that limitations of HDFS are an issue for you, there are HDFS alternatives for analytics cluster storage:
- Scale-out File System
These file systems have been designed for scale-out workloads, like high-performance computing or analytics. They deliver the performance required to get the most from your computing nodes or GPUs. Some of them have been optimized to take advantage of flash; these are suitable for machine learning use cases. Finally, as they offer standard file system interfaces, they are great for sharing data also outside a Hadoop cluster.
- NAS storage via NFS
Hadoop comes with an integrated driver for NFS storage. This is a suitable option for data sharing from outside analytics clusters. However, the dated NFS protocol was designed for communication with a single NFS server. Therefore, it is a major performance bottleneck and is not suited for scale-out workloads like Hadoop and other analytics applications. Read more about the limitations of NFS in our blog post here.
- Object Storage
Object Storage is a good option to overcome the problem with a large number of files/objects. However, object storage is designed for "cheap & deep" storage, so it doesn't really check the boxes for small file use cases, like machine learning or scale-out high throughput. Also, many object stores are unable to share objects as files, which is a pretty big limitation for data sharing.
Quobyte is a parallel, scale-out file system with a native HDFS compatible driver. Quobyte can handle all IO workloads, including small file and random IO, and gives you the scalability to grow your cluster to billions of files.
Are you looking for a hands-on approach? Use our Free Edition and learn how to install a multi-node Hadoop or Spark cluster with Quobyte.