The Cloudera Scoop: One Platform Initiative, RecordService & Kudu

Total Shares

Cloudera NewsCloudera has recently announced the One Platform Initiative, RecordService and Kudu. During a recent keynote speech, at the Cloudera Sessions in Munich, Mike Olson of Cloudera spent some time talking about the motivations for those moves.

This blog post shares what he spoke about, mixes that with some of my own reading and then adds my own take on some of the announcements. It also covers some of the other information relative to Cloudera which Mike shared during his speech. To find out more about the first half of his keynote see this post.

After reading this post you will understand what the One Platform Initiative is all about, what RecordService is and what it is trying to do, what Kudu is as well as what it is not and know a little bit more about Cloudera using the facts Mike shared as part of his keynote.

The recent Cloudera announcements

One Platform Initiative

One Platform Initiative – The common thinking is that Apache Spark is to eventually replace Map Reduce. This will solve a lot of the batch mode issues that map reduce has created. Spark was developed by academics and ignored a number of enterprise issues such as security (securing cluster, encrypting data, user and role based authentication and access etc.). It still does not scale to the degree Map Reduce does and it needs to be easier to deploy, operate and manage. This is the focus of the One Platform initiative. It aims to make Spark a first class member of the Cloudera enterprise data hub which means dealing with all the issues previously outlined above.  In short making Spark enterprise class.

.@Cloudera has more contributors and committers on the Spark project than all other vendors combined - @mikeolson #Hadoop #Bigdata Click To Tweet

My View:  It is clear that the whole world is moving towards in-memory processing. The shift to SPARK is somewhat inevitable compared to Map Reduce. With that said there is much to be done and many of the things that are needed Cloudera outlined above. This ambitious project could be the making of Hadoop for the next years as it will lead SPARK to be used for many things beyond just being the Analytics engine many see it as today and bring it to the level it is needed to be deployed reliably, securely and at scale within enterprises. 

RecordService

SecurityRecordService – Cloudera is very focused on securing data and providing appropriate privacy. Today it is possible to use the same data in many different ways within a Hadoop cluster:

  • Map Reduce Job
  • SQL Query to Hive or Impala
  • Launch a search
  • Analytics in SPARK
  • Third party partner products

Essentially there are many different frameworks which can use the same data without you needing to move it around.  The problem is that today there is no real way to apply a data policy once to the data and have it apply to every framework or route to the data.  In the past this has been a major piece of work requiring manual tweaks to the various engines to respect security being managed by the system.

It is this issue that the RecordService project is trying to address. The idea is to provide a new layer in the platform that is an extension to Apache Sentry, which today allows you to secure data and understand which users can access and use it, with the idea that ultimately it could support ANY of the current or emerging Hadoop frameworks. The prime goal is to enable data security and policies to be set once and then have all the frameworks respect that.

My View: I can understand why Cloudera is seeking an approach to provide a single security layer across all the frameworks, and partner frameworks, that are a part of its ecosystem. It makes a great deal of sense as it will speed development, make the overall platform more secure and help customers. What is interesting is that while Cloudera speaks about Apache Sentry, which this will extend, as the industry standard Hortonworks is currently more a proponent of Apache Ranger. There are people who follow, and write about, Hadoop security far more than me but I think this post is a good one to get a view.  So now the question is will Hortonworks respond? For example is it already working on the same thing to extend Ranger to other Frameworks than where it is supported today? – I expect the answer is yes. If that is the case we still have competing security frameworks with the same goals :-(. It would be nice to see some consolidation.

In the meantime this is great news for Cloudera customers who should start to see a lot more security consistency across the various frameworks and every one of them will welcome that!

Kudu

Kudu LogoKudu – The first thing I learnt about Kudu is that it is not an SQL engine. In fact it provides no interface to access data itself via SQL. This is important as for a while I thought this was another competitor to the SQL on Hadoop engines and was trying to work out the overlap with Impala.

So lets step back. In Hadoop you can store data a number of ways. The original way was to write very large (initially log) files and process them in order. The issue with that is you were not able to really do on demand quick delivery of records or even to make updates easily as once written the data was immutable. To resolve that Apache HBase was created in 2009. HBase is a NoSQL store where you can store single records and do overwrites pretty quickly.

So today we have HDFS which is great if you want to scanning lots of data.  HBase is good at creating lots of records and working with individual records but not so good when you want to do full table scan. As a result Mike asserted that neither HDFS nor HBase lets you both create a table with a collection of records and scan through them quickly.  The view is that this would help us to handle the relational workloads companies are looking for today. For example creating a table, loading a collection of records and looking at sales by region. That was not something that could be done quickly today. When you then look forward to things like time series data and sensor data There will be a continuous stream of data you need to be able to store, query and scan to compute various things.

Kudu is aimed at dealing with the hybrid store, update and scan issue. It is not a replacement for HDFS which remains great for workloads where data is not really being updated in place but there is a LOT of data. It is not a replacement for HBase which works great on smaller data volumes where you need to do updates and work at the record level but without too many full scans. So the idea is that now you have three different possible storage framework you can use to host your data. The cloudera view is this will open Hadoop to new workloads that were not really able to be run sensibly before.

I spent time watching the announcements and reading this paper. This is what helped me understand this was not another SQL engine.  Today Kudu offers APIs only in Java and C++ along with some experimental support for Python. It offers

“a simple API for row-level inserts, updates, and deletes, while providing table scans at throughputs similar to Parquet, a commonly-used columnar format for static data.” – Taken from the paper linked above

From what I can tell Kudu can currently work with Map Reduce, Spark and Impala. I might be missing something though.

My View: Cloudera basically states that there has been little innovation in the storage frameworks in the Apache Hadoop Ecosystem for some time. Kudu is a project they have incubated for 3 years to try to address this. It does feel like there is a bit of catch up here to the proprietary storage framework under MapR for example.

At the moment it feel to me like in the early days the main benefit will fall to Impala customers who will benefit from using Kudu and the extra capabilities it will bring.  It is unclear to me how you trigger storing data in Kudu. It seems obvious that via the frameworks above (Impala, Map Reduce and Spark) you might be able to do that but it is not clear if any of the data loading integration tools are stepping into that space. Without that it will come down to someone writing code.

Once again this is an interesting move which could open the doors for things like Hive to add some additional capabilities as well once it makes it into the Apache foundation. Anything that shakes the ecosystem and offers support for potentially more use cases is welcome.

The Cloudera Role & Commitment

Hadoop GrowthMike then went on to speak about the Open Source community. He showed that in 2008 Hadoop was just two things. That was HDFS and Map Reduce. Since then new open source projects have been created and joined the family every year.

At @Cloudera our job is to be experts on this (open source) progression - @mikeolson Click To Tweet

Cloudera needs to know what is coming in that world and work out which projects to collect and aggregate into the platform. He then stated that there are more than 25 open source projects in the Cloudera platform today and that will grow. The job of Cloudera is to pull together the right things to provide a reliable, stable and secure platform for the enterprise. He said the innovation pace far outstrips anything we have seen in RDBMS. This is happening as there are many people in the community contributing.

Lastly Mike shared that Big Data is going to change the world. He then said that this change is not because of Cloudera though. Instead he heaped praise on the 1900+ partners of Cloudera that help take their great platform into the enterprise and deliver valuable solutions.

He then finished off the talk talking about the 18 person team they now have on the ground in the region (with plans to grow) and there is a new customer service centre coming online in Budapest.

With that he made a final impassioned plea that everyone should start their journey today!

My Conclusion

I found Mikes presentation demonstrated his passion for using technology for change (something I share). He has been on the Hadoop trail for a lot longer than I and that is obvious when he discusses the relative merits of one component over the other.

The new developments are all clearly great steps forward for Cloudera but there are still situations where there will be divergence of which apache projects might be used by which Hadoop vendor. This divergence has the potential to lock a company into one vendor up front and so it becomes even more important to be clear about what you need to do and where the Hadoop vendors are headed before you start on your big data (Hadoop) journey.

I would love your feedback on this blog or any of my comments.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.