Hortonworks: Thinking about the HDFS vs. Other Storage Technologies
August 7, 2012 No CommentsSOURCE: Hortonworks
As Apache Hadoop has risen in visibility and ubiquity we’ve seen a lot of other technologies and vendors put forth as replacements for some or all of the Hadoop stack. Recently, GigaOM listed eight technologies that can be used to replace HDFS (Hadoop Distributed File System) in some use cases. HDFS is not without flaws, but I predict a rosy future for HDFS. Here is why…
To compare HDFS to other technologies one must first ask the question, what is HDFS good at:
- Extreme low cost per byte
HDFS uses commodity direct attached storage and shares the cost of the network & computers it runs on with the MapReduce / compute layers of the Hadoop stack. HDFS is open source software, so that if an organization chooses, it can be used with zero licensing and support costs. This cost advantage lets organizations store and process orders of magnitude more data per dollar than tradition SAN or NAS systems, which is the price point of many of these other systems. In big data deployments, the cost of storage often determines the viability of the system.
- Very high bandwidth to support MapReduce workloads
HDFS can deliver data into the compute infrastructure at a huge data rate, which is often a requirement of big data workloads. HDFS can easily exceed 2 gigabits per second per computer into the map-reduce layer, on a very low cost shared network. Hadoop can go much faster on higher speed networks, but 10gigE, IB, SAN and other high-end technologies double the cost of a deployed cluster. These technologies are optional for HDFS. 2+ gigabits per second per computer may not sound like a lot, but this means that today’s large Hadoop clusters can easily read/write more than a terabyte of data per second continuously to the MapReduce layer.
- Rock solid data reliability
When deploying large distributed systems like Hadoop, the laws of probability are not on your side. Things will break every day, often in new and creative ways. Devices will fail and data will be lost or subtly mutated. The design of HDFS is focused on taming this beast. It was designed from the ground up to correctly store and deliver data while under constant assault from the gremlins that huge scale out unleashes in your data center. And it does this in software, again at low cost. Smart design is the easy part; the difficult part is hardening a system in real use cases. The only way you can prove a system is reliable is to run it for years against a variety of production applications at full scale. Hadoop has been proven in thousands of different use cases and cluster sizes, from startups to Internet giants and governments.
How does the HDFS competition stack up?
This is an article about Hadoop, so I’m not going to call out the other systems by name, but I assert that all of the systems listed in the “8 ways” article don’t compare well to Hadoop in one of the above dimensions. Let me list some of the failure modes:
- System not designed for Hadoop’s scale
Many systems simply don’t work at Hadoop scale. They haven’t been designed or proven to work with very large data or many commodity nodes. They often will not scale up to petabytes of data or thousands of nodes. If you have a small use-case and value other attributes, such as integration with existing apps in your enterprise, maybe this is a good trade-off, but something that works well in a 10 node test system may fail utterly as your system scales up. Other systems don’t scale operationally or rely on non-scalable hardware. Traditional NAS storage is a simple example of this problem. A NAS can replace Hadoop in a small cluster. But as the cluster scales up, cost and bandwidth issues come to the fore.
- System that don’t use commodity hardware or open source software
Many proprietary software / non-commodity hardware solutions are well tested and great at what they were designed to do. But, these solutions cost more than free software on commodity hardware. For small projects, this may be ok, but most activities have a finite budget and a system that allows much more data to be stored and used at the same cost often becomes the obvious choice. The disruptive cost advantage of Hadoop & HDFS is fundamental to the current success and growing popularity of the platform. Many Hadoop competitors simply don’t offer the same cost advantage. Vendor price lists speak for themselves in this area (where the prices are even published).
- Not designed for MapReduce’s I/O patterns
Many of these systems are not designed from the ground up for Hadoop’s big sequential scans & writes. Sometimes the limitation is in hardware. Sometimes it is in software. Systems that don’t organize their data for large reads cannot keep up with MapReduce’s data rates. Many databases and NoSql stores are simply not optimized for pumping data into MapReduce.
- Unproven technology
Hadoop is interesting because it is used in production at extreme scale in the most demanding big data use cases in the world. As a result thousands of issues have been identified and fixed. This represents several hundred person-centuries of software development investment. It is easy to design a novel alternative system. A paper, a prototype or even a history of success in a related domain or a small set of use cases does not prove that a system is ready to take on Hadoop. Tellingly, along with listing some new and interesting systems, the “8 ways” article says goodbye to some systems that have previously been considered HDFS contenders by vocal advocates. I’ve got a rolodex full of folks who used to work on such systems who are now major players in the Apache Hadoop community.
It is easy to find example use cases where some other storage system is a better choice than Hadoop. But I assert that HDFS is the best system available today to do exactly what it was built for, being Hadoop’s storage system. It delivers rock solid data reliability and very high sequential read/write bandwidth, at the lowest cost possible. As a result, HDFS is, and I predict it will remain THE storage infrastructure for the vast majority of Hadoop clusters.