Swimming in a lake of confusion: Does the Hadoop data lake make sense?

Total Shares

Originally published on SAS Voices

Pig SwimmingAll hail the data lake, destroyer of  enterprise data warehouses and the solution to all our enterprise data access problems! Ok – well, maybe not. In part four of this series I want to talk about the confusion in the market I am seeing around the data lake phrase, including a look at how the term seems to be evolving within organizations based on my recent interactions.

In my previous post, I discussed three common use cases for deploying Hadoop alongside existing enterprise data warehouse infrastructures, along with some of the benefits and reasons for doing so. All three cases either required no change to existing approaches, or they required only small changes in where the data should be flowing when Hadoop was introduced. Effectively, they are less disruptive approaches.

Over the past six months, however, I’ve started hearing more questions around something that many refer to as the data lake (you might also hear it called the enterprise data hub or other derivatives of that term). As you’ll commonly find during emerging phases of a popular new technology, there’s great confusion about the meaning of this term, and I think it is evolving.

The original data lake view

The wiktionary definition for a data lake is perhaps a good starting point:

datalake1

If you develop that idea further the aim of deploying a data lake, based on Hadoop, is to move the majority of an organization’s data into Hadoop, with a long-term goal of Hadoop becoming the predominant and only data store over time, removing the need for the enterprise data warehouse (EDW) and data marts. While the practicalities, from both a security and ecosystem maturity standpoint, are still open to debate, the concept is worth contemplation.

When I think about the idea of a data lake, I am not convinced that the dream of doing everything in Hadoop ONLY is going to become a reality anytime soon, especially given the amount of re-engineering that would be required for most organizations. Even if we negate the re-engineering issues, we still have to consider whether Hadoop is ready to support all the applications that the EDW is supporting today, with appropriate service level agreements and regulatory requirements. Who knows if this will be the case in three to five years time.

Most organizations I have worked with have also reached the same conclusion, and hence I have noticed a distinct shift in what people perceive as the data lake.

The evolving data lake view

On some of my visits with larger organizations, I have started to see a new definition of the data lake emerging. In this new definition, organizations are redefining the data lake to represent the complete new data ecosystem they want to build. At the heart of this, Hadoop still retains a role as the predominant data store to collect everything up front, but the thinking does not stop there.

datalake21

 

In these new style data lakes, previous investments made in RDBMS technologies and data appliances still play a role in storing data (along with emerging structured stores in Hadoop such as Hive, Cloudera Impala and Pivotal Hawq) and providing structure for certain applications downstream. If you like, Hadoop is the single source for all downstream applications and it is the place most of the “transformation” of data is intended to take place so that the end result can mostly just be copied into the downstream store. In this picture, event detection, real-time analytics and much more is linked to that initial Hadoop store and the amount of data that is stored and accessed downstream is likely to decrease greatly.

To make things more confusing, some people use the term data reservoir to describe this process of going from raw data, to Hadoop, and then out to structured stores. In this approach, the idea is that you slowly refine the data over time to get to what is made available. To this extent, the phrase data reservoir is emerging to describe a Hadoop centric platform that is still built with the idea of structure, as opposed to the data lake, which lacks structure or control.

Hopefully the market settles on a definition soon. I prefer the broader picture as I think it reflects the reality of where we will be for some time.

Other important technologies, not covered in this blog, are Data Federation/Virtualization and Streaming Analytics technologies. In almost all cases these two technologies play a major role in the path forwards around a Data Lake no matter which definition you apply. More on their role in future blogs.

Starting point

The organizations we are working with have all generally started out looking at one of the three first scenarios to build out their knowledge of Hadoop and to deliver some quick wins. From there, the more ambitious ones have had their enterprise architects looking at whether it’s possible to implement the data lake concept (mostly with the view of Hadoop as part of an overall new data ecosystem not a total replacement). Many of these organizations are approaching their efforts with the assumption that some of the Hadoop limitations we see today will be rapidly overcome, and they will be positioned as the first movers right when the next generation digital data wave hits.

What are you seeing around the data lake term? Feel free to use the comments part of this post to share your views on data lakes (or their equivalents). I am interested to see how many more definitions there are in the market and to talk with organizations that are looking at data lakes as the path forwards.

The Big Data Lab!

In my next blog post I will discuss the concept of a “big data lab.” Many organizations are starting out their journey in the Hadoop world. Their hope is that a big data lab will allow them to setup innovation centers that are not bogged down by the weight of the production regulated worlds that generally define how fast they can innovate today.

Image Credit

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.