Hadoop Distributed File System (HDFS for short) is the primary data storage system used by Apache Hadoop applications to manage large amounts of data and support related big data analytics applications. The Hadoop File System (inspired by Google’s large-scale software-based infrastructure) was developed mainly to provide cost-effective and scalable storage for MapReduce workloads. It is most commonly used to implement a distributed file system that provides high-performance access to data across highly scalable Hadoop clusters.
The HDFS Architecture
HDFS has three components:
There is a single NameNode per cluster. This service stores the file system metadata (e.g., the directory structure), and the file names, and also, the location where the blocks of a file are stored.
These services usually run on each server that also does the compute. Each DataNode stores blocks from many files. The DataNodes are also responsible for replicating these blocks to other DataNodes to protect files against data loss.
The client provides access to the file system and the Hadoop/MapReduce jobs that want to access the data. The client, a piece of software embedded in the Hadoop distribution, communicates with the NameNode to find the file and retrieve the block’s locations. Then the client reads or writes the blocks directly from the DataNodes.
Often, you will read the term HDFS-API. HDFS is not the only storage system that works with Hadoop. Other storage systems implement their own clients that have the same API as the HDFS client.
How HDFS is Different from “normal” File Systems
While HDFS is called a file system, it is quite different from traditional POSIX file systems. HDFS has been designed for a single workload – MapReduce – and doesn’t support generic workloads. HDFS is also write-once, which means that files can be written once and then only be read. This is perfect for Hadoop and other big data analytics applications; however, you wouldn’t be able to run a transactional database or virtual machines on HDFS.
In addition, HDFS is tuned towards large sequential reads from large files – it has a chunk size of 64MB. Hadoop was designed to work with large amounts of data, and it had to be optimized to work efficiently on hard drive storage.
Limitations of HDFS
The focus on this IO pattern was perfect for Hadoop/MapReduce, but the IO patterns have changed due to frameworks like Apache Spark and the omnipresence of machine learning. The drop in flash prices has made – at least partial – flash tiers very feasible for large-scale analytics clusters. There are two limitations in HDFS for those newer analytic workloads:
- NameNode limitations
Each HDFS cluster has one name node responsible for storing metadata information, like the filename and where the file’s content is stored. This single node limits the number of files you can have in each cluster. This is a severe limitation when you deal with millions of small files, typical for machine learning.
- Smaller and random IO
Hadoop has not been designed for concurrent high IOPS, which is the IO pattern where flash drives excel. Again, many small files and low latency requirements from machine learning are sources of such IO patterns.
What is Software-defined Storage?
What is Kubernetes Storage?
What is the Network File System (NFS)?
Quobyte - a scale-out file system with HDFS support
Talk to Us
We are here to answer all of your questions about how Quobyte can benefit your organization.
Are you ready to chat? Want a live demo?