What is HDFS, the Hadoop File System?
Hadoop Distributed File System (HDFS for short) is the primary data storage system used by Apache Hadoop applications as a means for managing pools of big data and supporting related big data analytics applications. Developed mainly to provide cost-effective and scalable storage for MapReduce workloads, the Hadoop File System (inspired by Google's large scale software based infrastructure) is most commonly used to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
The HDFS Architecture
HDFS has three components:
There is a single NameNode per cluster. This service stores the file system metadata (e.g. the directory structure), the file names and also where the blocks of a file are stored.
These services usually run on each server that also does the compute. Each DataNode stores blocks from many files. The DataNodes are also responsible for replicating these blocks to other DataNodes to protect files against data loss.
The client provides access to the file system and to the Hadoop/MapReduce jobs that want to access the data. The client, which is a piece of software that is embedded in the Hadoop distribution, communicates with the NameNode to find the file and retrieve the block locations. Then the client reads or writes the blocks directly from the DataNodes.
Often, you will read the term HDFS-API. HDFS is not the only storage system that works with Hadoop. Other storage systems implement their own clients that have the same API as the HDFS client.
How HDFS is Different from "normal" File Systems
While HDFS is called a file system, it is quite different from traditional POSIX file systems as HDFS has been designed for a single workload - MapReduce - and doesn't support generic workloads. HDFS is also write-once, which means that files can be written once and then only be read. This is perfect for Hadoop and other big data analytics applications, however, you wouldn't be able to run a transactional database or virtual machines on HDFS.
In addition, HDFS is tuned towards large sequential reads from large files. Hadoop was designed to work with large amounts of data and had to be optimized to work efficiently on hard drive storage.
Limitations of HDFS
The focus on this one IO pattern was perfect for Hadoop/MapReduce, but the IO patterns have changed due to frameworks like Apache Spark and the omnipresence of machine learning. The drop in flash prices has made - at least partial - flash tiers very feasible for large-scale analytics clusters. There are two limitations in HDFS for those newer analytic workloads:
- NameNode limitations
Each HDFS cluster has one name node that is responsible for storing metadata information, like the filename and where the file contents is stored. This single node limits the number of files you can have in each cluster. This is a severe limitation when you deal with millions of small files, typical for machine learning.
- Smaller and random IO
Hadoop has not been designed for concurrent high IOPS, which is the IO pattern where flash drives excel. Again, lots of small files and low latency requirements from machine learning are a source of such IO patterns.
If you decide that limitations of HDFS are an issue for you there are HDFS alternatives for analytics cluster storage:
- Scale-out File System
These file systems have been designed for scale-out workloads, like high performance computing or analytics. They deliver the performance required to get the most from your compute nodes or GPUs. Some of them have been optimized to take advantage of flash, these are suitable for machine learning use cases. Finally, as they offer regular file system interfaces they are great for sharing data also outside a Hadoop cluster.
- NAS storage via NFS
Hadoop comes with an integrated driver for NFS storage. This option is good for data sharing from outside analytics clusters. However, the dated NFS protocol was designed for communication with a single NFS server and it is a major performance bottleneck and not suited for scale-out workloads like Hadoop and other analytics applications. Read more about the limitations of NFS in our blog post here.
- Object Storage
These are a good option to overcome the problem with a large amount of files/objects. However, object storage is designed for "cheap&deep" storage, so it doesn't really check the boxes for small file use cases, like machine learning, nor for scale-out high throughput. Many object stores are also unable to share objects as files, which is a pretty big limitation for data sharing.
Quobyte is a parallel, scale-out file system with a native HDFS compatible driver. Quobyte can handle all IO workloads, including small file and random IO, and gives you the scalability to grow your cluster to billions of files.
Looking for a hands-on approach? Use our Free Edition and learn how to install a multi-node Hadoop or Spark cluster with Quobyte.