Inside the Briefcase

Augmented Reality Analytics: Transforming Data Visualization

Augmented Reality Analytics: Transforming Data Visualization

Tweet Augmented reality is transforming how data is visualized...

ITBriefcase.net Membership!

ITBriefcase.net Membership!

Tweet Register as an ITBriefcase.net member to unlock exclusive...

Women in Tech Boston

Women in Tech Boston

Hear from an industry analyst and a Fortinet customer...

IT Briefcase Interview: Simplicity, Security, and Scale – The Future for MSPs

IT Briefcase Interview: Simplicity, Security, and Scale – The Future for MSPs

In this interview, JumpCloud’s Antoine Jebara, co-founder and GM...

Tips And Tricks On Getting The Most Out of VPN Services

Tips And Tricks On Getting The Most Out of VPN Services

In the wake of restrictions in access to certain...

A New Data Warehouse Architecture for the Brave New BI World

February 18, 2015 No Comments

Featured article by Paul Moxon, Senior Director of Product Management at Denodo Technologies

Big Data, Internet of Things, Data Lakes, Streaming Analytics, Machine Learning…these are just a few of the buzzwords being thrown around in the world of data management today. They provide us with new sources of data, new forms of analytics, and new ways of storing, managing and utilizing our data. The reality however, is that traditional data warehouse architectures are no longer able to handle many of these new technologies and a new data architecture is required.

The traditional data warehouse architectures are designed to replicate data from operational systems into a central data warehouse on a scheduled basis – daily, weekly, monthly, etc. During this replication process, the data would be transformed and cleaned and often aggregated. The resultant data would also conform to the pre-defined data model of the target data warehouse. (This is referred to as ‘schema on write’ i.e. the data must conform to the pre-defined target schema as it is written to the data warehouse). Once the data was in the warehouse, it was consider to be trustworthy and could be used to create the numerous reports that oiled the organizational workings. However, the whole process of replicating and cleansing the data took time to create and execute. And time is something that is in short supply in today’s business world.

These traditional architectures are not designed to cope with the vast amount of data and the varying data formats that can be generated today – whether data from mobile devices, sensor data, web data, and so on. Compared to the structured, clean conformity of the data warehouse, the new data and data sources represent anarchy! All kinds of data can be thrown into your Big Data store without needing to clean it or make it conform to a defined data schema. The idea of the Big Data stores is “store the data and we’ll work out how to read it later”. This is classic ‘schema on read’ i.e. you work out the format of the data when you come to read it. This turns the traditional data warehouse architectures on their head.

Based on the seeming incompatibility between the traditional and requirements for handling this new data, some vendors and analysts are suggesting that organizations should adopt a ‘data lake’ architecture in which all data is poured into the ‘lake’ (based on Hadoop) to provide a single enterprise wide data repository. However, the reality is that exiting data warehouses are not going anywhere soon. Companies are simply not going to throw away up to three decades worth of investment that they have made in these technologies and tools. The data warehouse will remain as a source of clean, verified corporate data for the foreseeable future.

So, the new architecture needs to extend the existing data warehouse to accommodate and incorporate all the good things that are coming from the ‘new data’ world. Leading industry experts Claudia Imhoff and Colin White have expounded the idea of an Extended Data Warehouse architecture which encompasses both the traditional data warehouse and the exciting new data environment. The new Extended Data Warehouse architecture (shown in Figure 1) contains a number of key components, namely:

Traditional EDW Environment – This is exactly what is says…the traditional data warehouse environment complete with BI and reporting tools. The data stored in this environment is typically aggregated, highly structured, conforms to the data warehouse schema, and has been cleansed and verified. This is the environment for traditional reporting and analysis e.g. production reporting, historical comparisons, customer analysis, forecasting, etc. In many companies, the data in this environment is the oil that keeps the organization running.

 

Denodo_image1 revised2

Figure 1- Extended Data Warehouse Architecture

 Investigative Computing Platform – The investigative computing platform contains technologies such as Hadoop, in-memory computing, columnar storage, data compression, etc. and is intended to provide the environment for managing and analyzing massive amounts of detailed data – only some of which might actually be useful. This is the environment where the data scientists perform data mining, predictive modeling, cause-and-effect analysis, pattern analysis, and other advanced analytical investigations.

Data Integration Platform – The data integration platform is the place where the heavy lifting of extracting, cleaning, transforming, and loading the data into the data warehouse is performed. Traditionally this has been done as a batch load (ETL/ELT), but can also be via trickle feed (Change Data Capture) and data virtualization can be used within the data integration platform. As the data integration platform is being used to load data into the trusted data warehouse, this environment requires more formal data governance policies to manage data security, privacy, data quality, and so on.

Data Refinery – The data refinery allows the users of the investigative computing platform to access and filter the data that they need for their analysis. The data refinery – as its name suggested – refines the raw data to provide useful data to be analyzed. Because of the nature of the investigative computing platform, the data refinery is more flexible with its governance policies – quick access and fail-fast analysis is more important that strict governance. Data virtualization is a technology that is a key part of the data refinery.

Others – In addition to the above components, there are the operational systems and real-time analysis engine, feeding off the operational systems and real-time streaming data, to support use cases such as real-time fraud detection, stock trading analysis, location-based offers, etc.

All of these components can’t work in isolation – that just results in data silos that actually inhibit the organization’s ability and agility to react to business events. Investigative analysis cannot be performed in a vacuum, the analysis must be performed in the context of the business and that context is typically contained within the information stored in the traditional data warehouse. So, if you have a data scientist building predictive customer behavior models, the customer data that they need for the context of their model is typically stored within the data warehouse. Figure 2 illustrates this type of scenario.

 

Denodo_image2 revised

Figure 2- Extended Data Warehouse Architecture Interactions

 Data virtualization is the glue that binds the various components together. Using a data virtualization layer, the data scientists can access any data that they need for their modeling and analysis – whether it is in the traditional data warehouse, in newer data sources such as Hadoop or NoSQL databases, or completely external to the organization. The data is presented in the form that is most useful to the data scientists. If they are using advanced visualization tools, such as Tableau, the data appears as if it is a relational table. If they are writing their own statistical algorithms using R, the data looks like a JDBC data source. If they are creating mobile or web applications, the data is available as a web service, and so on. The agility and ease-of-use of accessing this data through the data virtualization layer means that they data scientist spends more time analyzing the data and less time trying to figure out how to get the data and get it in a usable format. After all, you want your valuable (and expensive) data scientists and statisticians providing insights in to the business and not spending their time writing data access and conversion utilities!

If you are interested in learning more, listen to Claudia Imhoff talk about the Extended Data Warehouse Architecture and examples of how companies are using data virtualization to implement an investigative computing platform, combining ‘new data’ with more traditional data from existing systems. To listen, visit “Extended Data Warehouse – a New Data Architecture for Modern BI

Paul-Moxon (2)

Paul Moxon is Senior Director of Product Management responsible for product management and solution architecture at Denodo Technologies, a leader in Data Virtualization software. He has over 20 years of experience with leading integration companies such as Progress Software, BEA Systems, and Axway. For more information contact him at pmoxon@denodo.com.

Leave a Reply

(required)

(required)


ADVERTISEMENT

DTX ExCeL London

WomeninTech