From Big Data Swamps to Big Data Value with the Right Infrastructure
January 30, 2015 No CommentsBy Tom Phelan, Co-Founder and Chief Architect, BlueData
Big Data promises a lot—new insights about customers, shorter product development cycles, new and increased streams of revenue, and more. But capitalizing on all that promise requires radical shifts in the way data is collected, managed, stored, accessed and used.
Big Data and Big Challenges
A.T. Kearney forecasts global spending on Big Data hardware, software and services will grow at an average annual rate of 30 percent through 2018, reaching a total market size of $114 billion. But even though businesses are spending millions on Big Data-related initiatives, their return on investment is no sure thing. According to Gartner, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage this year. And according to Forrester Research, most companies estimate they’re only analyzing 12 percent of the data they have.
What’s holding these companies back? The infrastructures used today were not purpose-built to handle dynamic Big Data workloads or the changing needs of data scientists. They are extremely rigid, complex and expensive. Mapping a Big Data enterprise plan requires new considerations for enterprise applications, storage, and the supporting hardware/infrastructure. Enterprises need new solutions that help them navigate existing disparate and disjoint data silos, silos that currently are difficult to secure and manage.
Finding the Right Solution
Several new technologies have emerged, designed to make Big Data, and its underlying infrastructure, more manageable, accessible and useful. One approach, known as a data lake, has gotten a lot of attention.
A data lake is a repository for large quantities and varieties of data, both unstructured and structured. Data lakes store data in its original, native format, for later transformation. As PwC describes, “customer, supplier, and operations data are consolidated with little or no effort from data owners.” Proponents argue that, compared to data warehouses, data lakes are a lot less expensive on a per-terabyte basis to set up and maintain.
While data lakes can help enterprises breakdown the age-old problem of information silos and better deal with varied information, Gartner cautions that they may not work for everyone. In a release, Gartner VP and distinguished analyst Andrew White said, “Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used—data is simply dumped into the data lake. However, getting value out of the data remains the responsibility of the business end user. Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place.” A data swamp.
Specifically, data lakes offer no way to manage, govern or secure the data. In addition, to populate a data lake, data frequently needs to be replicated from its existing storage silo, and in doing so the amount of data an enterprise must store, while already massive—only gets bigger. Storage may be cheap, but it still costs money. An equally important consideration is network performance. If all the data is centralized in one locale, will there be performance penalties when that data is accessed and leveraged from remote sites?
The Big Data Private Cloud
There is a new Big Data infrastructure approach emerging—one that provides enterprises with a private cloud for Big Data applications. It tackles head-on the overwhelming complexity of Big Data, which most enterprises simply do not have the IT and DevOps staffs to handle. As our co-founder and CEO Kumar Sreekanti puts it, Hadoop was initially designed by engineers, for engineers. “In fact those companies that are successful with Hadoop are the Silicon Valley who’s who of companies with many large engineering teams that tune-up these systems and make them valuable. Unfortunately, most enterprises like banks and pharmaceuticals and media companies do not have that kind of engineering talent.”
A Big Data private cloud provides the benefits of a data lake: it works with all types of data, structured and unstructured, and eliminates data silos by centralizing that data. But it does so through virtualization techniques, so unlike data lakes, there’s no need to copy all the data into a single and centralized data repository. By running Big Data private clouds, enterprises can keep data where it has always resided while creating a “centralized” experience for all that data so it can be easily managed, governed and accessed. The underlying software-defined infrastructure creates a virtualization platform for Big Data that separates analytics processing from the data storage.
Simplify and Democratize Big Data
Enterprises are already sold on initiatives that serve up better customer information, improve and streamline product development cycles, increase competitiveness and generate more revenue. And they want to believe in and use Big Data, which promises to deliver on so many of these. But Big Data infrastructure is complex, rigid and expensive, and the fundamental applications that leverage Big Data, such as Hadoop, can be equally difficult. This complexity is impeding Big Data’s value.
What enterprises need are technology solutions that simplify and democratize Big Data, and its uses, so even the least sophisticated users can benefit. Data lakes and Big Data private clouds are two such solutions.
Tom Phelan, Co-founder and Chief Architect, Blue Data
Tom has spent the last 25 years as a senior architect, developer and team lead in the computer software industry in the Silicon Valley. Prior to co-founding BlueData, Tom spent 10 years at VMware as a senior architect and team lead in the core R&D Storage and Availability team. Most recently Tom led one of the key projects – vFlash, focusing on integration of Server-based Flash into the vSphere core hypervisor. Prior to VMware, Tom was part of the early team at Silicon Graphics that developed XFS, one of the most successful open source file systems. Earlier in his career, he was a key member of the Stratus team that ported the Unix operating system to their highly available computing platform.
Tom received his Computer Science degree from the University of California, Berkeley.