The first decade is over and we’re entering the second. One industry watcher makes a great point: Awkward teenage years ahead?
I don’t believe we’ll be one of those ‘difficult’ teenagers. We might be a bit of a nerd, but we’ll be the well balanced one. The one with friends, the one that goes to the prom and goes off to college.
Let me explain why.
First, let’s pause a second. We’ve come a long way.
Back in 2006, Hadoop was relatively simple. The use case was batch analytics for web search and we started off with just a few hundred lines of code. We went from there and a whole bunch of new projects soon showed up, such as HIVE and HBASE. ‘It takes a village’ and no matter what you did; whether you wrote code, used the project, wrote documentation, or answered questions on the mailing list a lot of people have all taken a big part. For that, thank you.
That was just the beginning. In 2011 we then stood up and talked about YARN. We said” ‘we’ve made a quantum leap.’ Rather than just a batch system, now you could put all your data in one place and as the data develops what I call ‘gravity’ pull it in different applications, both batch and interactive. Here’s my favorite bat slap graphic from that time.
The last five years have flown. Since 2011,the ecosystem has gone even further. A lot of other new and exciting things have come in. A huge ecosystem has built up around Hadoop. What’s been exciting about this is not only do we have the open source community on board and contributing but also the enterprise vendors via ODPI.
So what’s next up? Lets look a little bit ahead. How do we make sure we aren’t that particularly troublesome teenager?
For one, we continue to push for Hadoop to be effective both in the data center and also in the cloud. Partners like Microsoft have been amazing collaborators to put HDInsight into the open community.
Second, you have to step back and see what people are doing with Hadoop today. Five years ago, we had components in the single digits. Today we have more than 25.
What’s happening as we push all of these technologies out is that customers are actually trying to solve real business use cases. Usually in the form of an application that is trying to deal with vast amounts of data and come up with an insight via BI, a reporting tool, or a predictive app.
I believe every modern app is fundamentally now a data app. Using data in a really interesting way to come up with insights and drive business outcomes.
Take for example, a modern credit card fraud detection app. This kind of app is getting a lot of data from different sources on user behavior to drive a predictive model. For example, to go back to the customer and say: “I saw your credit card was used in London but you are based in San Jose. Is this you?’
But if you peek under the hood, they are still using technologies like Kafka and Nifi to ingest, and Spark to build the models, or HBASE to store and serve the models and push it back. Until now it has not been particularly easy to stand all of of this up, and govern it, secure it, make sure its highly available, that there is disaster recovery and so on.
Here’s how we are avoiding the the ‘troublesome teenager’ thing.
Wouldn’t it be great if you could just download that fraud detection app and run it in your HDP cluster?
That’s exactly where we are with Connected Data Platforms — to make it so you can select engines and services via a user friendly user experience and run, secure and operate them as whole.
In other words, so you can see ’fraud detection’ as an application and not a set of individual technologies, and make it all secure. This means that, using Docker to isolate the application from its environment, you can take your modern data app, burn it as a container and then just upload and run on your Hadoop and Yarn clusters.
So the design principle we are moving towards is to be easy to use and operate, portable and repeatable. To be able to look at this as a single business application and scale it up and down based on the number of transactions. Or run a beta and production version on the same cluster HBASE. Regardless of what version of HDP or cluster you might be running.
Last and not least.
Another sign of us growing up is to build in security and governance from day one in this new world of modern data applications.
If you go into an enterprise today what the they want to do is to set policies on that data; prohibition, classification, lineage, provenance and so on.
So I’m really excited about all the things we are doing with the Atlas and Ranger communities. You can’t do security without governance and vice versa, so working with some key partners we have married these things together.
Now we have the capability to track meta data across all the components in the ecosystem and to control the apps. To tag all your data sets and put access policies around the individual tables and columns at the meta-data level.
So, for example, if we want to tag a data set as ‘personally identifiable information’, regardless of who copies it, the tags are inherited and we can put policies on the tags themselves. Which means you don’t have to govern individual data sets,
Big props to the community again for having brought a lot of this effort together.
My role in life in the first ten years was to transform Hadoop from doing one thing well to multi-workload. Now its to enable apps to run on the platform that need access to components and data, efficiently and effectively. To make it 100x easier to wire things together and deploy and operate them.
As a community we have to take Hadoop into the next decade and make it quicker and easier to get value our of the technologies. This is how we as a community will make a larger impact on the world and allow everyone to get value out of their Hadoop deployments.