You would never send a kid to college with a toddler bed. They’ve outgrown it, right? The Lightning McQueen bed that fit 15 years ago obviously no longer suits present needs. Why, then, wouldn’t an enterprise reach the same conclusion about its data storage?
What is HDFS?
Apache’s Hadoop distributed file system (HDFS) and its associated MapReduce engine started in 2006 as a few thousand lines of code inspired by the Google File System paper published in 2003.
The software’s core job is to take large amounts of data, divide up that data into smaller blocks, and then distribute those blocks across cluster nodes for faster, parallelized processing before reassimilation back into a final merge store. Hadoop proved perfect for kicking off the boom in big data analytics, and a lot of companies poured a lot of resources into developing infrastructure around Hadoop.
The problem begins with defining “large amounts of data.”
According to Statista, the global datasphere — the “volume of data/information created, captured, copied, and consumed worldwide” — measured 2 zettabytes in 2010. That number hit 64.2ZB in 2020 and is projected to be 181ZB in 2025.
Beyond size, the nature of data has grown more complex, with structured data being dwarfed by unstructured workloads. Hadoop flourished on the promise of fast, cheap analytics. Over 15 years, though, the amount of hardware scaling needed to accommodate today’s exploding, complex datasets left HDFS reeling.
It is neither fast enough nor cost-effective.
The kid has outgrown the bed.