Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
June 21, 2017
prev slideNext slide

Data Lake 3.0 Part 7 – What’s a self-driving car got to do with Data Lake 3.0?

This blog has contributions from: Vinod Vavilapalli, Wangda Tan, Gour  Saha, Priyanka Nagwekar, Sunil Govindan

You have probably wondered what makes a self-driving car intelligent to process the live camera feeds, navigate the busy streets and distinguish objects on the streets, such as cars, trucks, traffic lights or pedestrians? A self-driving car is a perfect example of a modern data application that combines big data with smart algorithms. To understand the underpinnings of such a modern data app, we will start with a recap of our blog series titled “Data Lake 3.0” (pt1, pt2, pt3, pt4, pt5, pt6) and then, conclude with the key takeaways from the keynote demo in Data Works Summit, San Jose, 2017.


We are seeing the emergence of modern data applications, that exploit the big data; are architected to be micro-service based and containerized; are compute/GPU intensive, and deployed on a commodity infrastructure. Our Data Lake 3.0 architecture is at the cross-section of all these major trends and we want to walk you through a simplified example of a self-driving car. If you want to familiarize with what a Data Lake 3.0 is, you might want to refer to pt1 of our blog series.


A self-driving car generates massive amount of videos that need to be captured and stored in a centralized active archive for access by data scientists and analysts. This requires a storage layer that can scale to billions of files, exabytes and accessible by end users, while being Total Cost of Ownership friendly. Hadoop storage layer (Apache HDFS) powered by Hadoop 3.0 provides the erasure coding to store the data at half the cost (vs. 3 replica approach), while allowing linear scale and unified name space with NameNode Federation and View FileSystem. It has device behavior analytics built in so that a slow commodity server and a slow commodity network switch will not interrupt a latency sensitive operation.

Now, a data scientist needs to train distributed deep learning models (by using frameworks like TensorFlow) that will process natural signals such as videos before the model gets deployed in the car and this is an ongoing task- the more it trains, the better the self-driving car gets. Training is a very compute intensive process. This is where Apache Hadoop YARN comes in: to pool the compute and memory across the cluster of commodity servers and process those models. YARN in Hadoop 3.0 can pool expensive GPU (graphic processing units) and isolate the GPU devices between multiple users (YARN-6223 captures the first class GPU support on YARN). For many models, one can see up to 50-100x reduction in compute processing time of the data intensive video files.  

Sample YARN assembly modern data app store

That brings us to a key aspect of our Data Lake 3.0 story i.e. the concept of an Assembly. We do not expect every one of the analysts to understand the infrastructure complexity in order to run their modern data applications. Instead, we want to them to go an application store similar to iPhone or Android app store, download the application and just run the application created and published by data scientists. This is where our Assembly store helps. An analyst can now deploy a modern data application (in this case, a self-driving car assembly which is a templated application), assigns the required GPU/memory required and runs it.


The top video above is the the raw input video (source: Udacity data sets). Inside the Self-Driving Car assembly, the video is broken into 15 frames per second and then the TensorFlow based deep learning model annotates the objects in the frames (car vs. traffic light vs. pedestrian vs. truck etc), creates a bunch of output frames, which are stitched back to an output video, which is the bottom video above. An analyst does not have to understand the complexity of TensorFlow, GPUs, Hadoop infrastructure and just focuses on his/her job, with access to the input and output videos. Our Data Lake 3.0 enables the entire life-cycle, with 100x faster productivity of the analysts at a lower TCO (2x storage reduction, sharing of expensive GPU resources across analysts).


Self-driving car is one of many modern data applications that  exemplify our Data Lake 3.0 use cases. We are working with our partners to bring more real-world modern data applications such as IBM Data Science Experience to our Data Lake 3.0 assembly store. To reiterate, here are key takeaways:

Please stay tuned for future Data Lake 3.0 updates from us. We are planning an early access program in future so you can build a Data Lake 3.0 architecture and provide us feedback. If you want to participate, please reach out to me or your account management team.


Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums