Product Tech

HDFS Must Die

Posted by  Quobyte  on  

You would never send a kid to college with a toddler bed. They’ve outgrown it, right? The Lightning McQueen bed that fit 15 years ago obviously no longer suits present needs. Why, then, wouldn’t an enterprise reach the same conclusion about its data storage?

Apache’s Hadoop distributed file system (HDFS) and its associated MapReduce engine started in 2006 as a few thousand lines of code inspired by the Google File System paper published in 2003. The software’s core job is to take large amounts of data, divide up that data into smaller blocks, and then distribute those blocks across cluster nodes for faster, parallelized processing before reassimilation back into a final merge store. Hadoop proved perfect for kicking off the boom in big data analytics, and a lot of companies poured a lot of resources into developing infrastructure around Hadoop.

The problem begins with defining “large amounts of data.” According to Statista, the global datasphere — the “volume of data/information created, captured, copied, and consumed worldwide” — measured 2 zettabytes in 2010. That number hit 64.2ZB in 2020 and is projected to be 181ZB in 2025. Beyond size, the nature of data has grown more complex, with structured data being dwarfed by unstructured workloads. Hadoop flourished on the promise of fast, cheap analytics. Over 15 years, though, the amount of hardware scaling needed to accommodate today’s exploding, complex datasets left HDFS reeling. It is neither fast enough nor cost-effective.

The kid has outgrown the bed.

 

Key Hadoop Shortcomings

In the early Hadoop days, scale-up architecture dominated while scale-out remained largely dormant, waiting for network fabric and hyperconvergence technologies to improve. To meet its performance goals, Hadoop needed compute and storage to be on the same machine, and HDFS became based around this locality paradigm. Additionally, because of the nature of batch-centric big data workloads at the time, HDFS was optimized for large, sequential I/O with 64MB block sizes. It was never designed to be a general-purpose file system.

Fast forward to the 2014 arrival of Apache Spark, an analytics engine for mass-scale data processing. Spark also distributes data over machine clusters, but it does so in a way that allows algorithms to touch data several times throughout processing. Hadoop’s MapReduce, on the other hand, is much more linear with how data must be read from storage, mapped, reduced, then stored, in that order. As a result, Spark offers dramatically lower latency compared to Hadoop and thus opens open real-time analytics possibilities for very large datasets. Machine learning also capitalizes on this faster storage processing — a boon in a decade in which seemingly any project that wants funding needs an ML component.

Another Hadoop issue lay in its NameNode limitations. The NameNode is where HDFS maintains its directory tree and tracks data locations. Any application that needs a file must consult the NameNode. As such, NameNode is a single point of failure, which in turn prevents Hadoop from being a natively high availability (HA) platform. Even short of failure, though, NameNode residing on a single server means that the entire storage architecture can be impaired if that NameNode server becomes overloaded, which is exactly what happens if that server gets slammed with tracking exploding amounts of data, particularly if the dataset is comprised of smaller files.

Over time, HDFS became a victim of its initial advantages. Data centralization can be a sound strategy, but only if the data framework under that centralization can adapt to changing workload types.

 

Quobyte Learned Hadoop’s Lessons

One element of Hadoop’s beginnings remains valid: Locality matters. No matter how fast the fabric and how wide the backbone pipes, sending mass-scale data volumes over the network, and especially into public cloud storage, can throttle application responsiveness. Even that little jaunt to the top of the server rack and back adds delay. In this regard, Hadoop had the right idea, and it’s a principle that Quobyte carries forward. Users can put Quobyte directly on the compute server or, if needed, adjacent in the same rack. Because Quobyte delivers data locality awareness, users get all the locality advantages of Hadoop without the scaling and performance issues.

An obvious reservation here is, “But all of my applications are built for the HDFS driver.” That’s another Quobyte advantage. The Quobyte driver is a native replacement for Hadoop. They use the same API, but users simply call Quobyte instead of HDFS. Zero changes are needed to applications.

And unlike Hadoop, Quobyte works with small clouds, hybrid clouds, big and small files, object storage, and random I/O, all with rock-bottom latency. Everything that modern analytics and ML needs, Quobyte delivers. Very few competing file systems can make these claims, and none can match Quobyte for flexibility, cost value, and ease of implementation.

But what about the hassle of migration? When data lakes scale into the petabytes, it can become seemingly impossible, or at least cost-prohibitive, to transition to a new storage platform. On that basis, staying with Hadoop can seem like the path of least resistance and highest uptime. Fortunately, there are ways to fish strategically from the lake. A review of data retention policies may expose opportunities for deleting (or at least deeply archiving) a significant percentage of total holdings. Third-party deduplication technologies can help siphon off a lot more. From there, orchestrated automation tools can dive into the remaining data and extract the needed information.

Clearly, there are some migration waves to cut through, but the short-term cost and inconvenience will be more than compensated by the long-term value and ROI benefits of moving from Hadoop to a better, truly scalable storage platform able to meet tomorrow’s demands.

Photo of Quobyte

Posted by

Quobyte enables companies to run their storage with Google-like efficiency and ease. It offers a fully automated, scalable, and high-performance software for implementing enterprise-class storage infrastructures for all workloads on standard server hardware.