Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
February 09, 2016
prev slideNext slide

Google Dataflow, Hortonworks DataFlow – What’s in a name? Are they the same?

People have been asking us – Is Google Cloud Dataflow the same thing as Hortonworks DataFlow (HDF)? So we thought we’d take the opportunity to share with you how we see these two products work together. Both have the word dataflow in their name, and both systems are rooted in the premise of dataflow programming, but beyond that there are significant differences.

Google Cloud Dataflow provides an abstraction to systems processing and analyzing data streams, such as MapReduce

Google Cloud Dataflow provides an abstraction layer to systems processing and analyzing data streams such as MapReduce, and is designed strictly for the Google Compute Cloud (ie a virtual data center). Google has donated this technology to the Apache Software Foundation, an open source community, as the currently incubating Apache Beam project.

Apache Beam can be considered an API for writing high-scale analytics and processing tasks which operate on streams or batches naturally and which provide an abstraction that rides above any particular underlying execution engine.  Why is this cool?  Because, for example, you can write the same analytic in Spark and have it run in Flink without having to rewrite it or you can migrate to a cloud provided service like Google Cloud Dataflow.

Hortonworks DataFlow is a product designed to solve data acquisition and delivery challenges – data logistics

Hortonworks DataFlow (HDF) is a product designed to solve dataflow challenges, inside OR outside of a data center. It’s data logistics, a concept similar to that of supply chain logistics. HDF provides a fast, easy and secure way to move data from anywhere it originates – i.e. a remote branch office or a Raspberry Pi-enabled camera – to anywhere it needs to go – i.e. Google Cloud Dataflow or your on-prem datacenter – and is powered by Apache NiFi.


Hortonworks DataFlow easily, securely moves data to where it needs to go.

Why is this cool? Because getting data from where it originates to where it needs to go isn’t an easy feat. In theory it’s simple, like a map showing directions from A to B. In reality, it’s complicated and there can be unforeseen events just like real-world traffic patterns – road construction, detours, traffic jams, no-left-turn signs, broken traffic lights, flooded roadways, toll highways, one-way streets. And like real-life, it’s not 100% predictable and dataflows need to adapt in real time in order to serve the analytics systems processing the data. That’s what Hortonworks DataFlow does – it’s a smart, interactive real-time control system to make sure your data gets to where it needs to go – with full event level data provenance to boot – for those concerned about how trustworthy their data is. (More about what provenance is here)

So the answer is that Google Cloud Dataflow and Hortonworks DataFlow have similarly sounding names to describe important, but very different things. One is a processing framework waiting for data to be delivered to it; and the other one – Hortonworks DataFlow – delivers data to all kinds of processing systems: Google Cloud Dataflow, Storm, Spark, Flink, etc.

For more info:

Hortonworks DataFlow Delivers Data
Hortonworks DataFlow delivers data to streaming analytics systems such as Google Cloud Dataflow, Flink, Spark and Storm


Marco bot says:

So why did you give your product the same name of a product that already existed? I guess less people would be asking you the difference and this blog post would be unnecessary.

sindhu seenivasan says:

This is an interesting topic. I know that we have tools like apache flume to ingest unstructured data onto HDFS from which information can be processed as needed. I want to know what is the real value or importance of a separate product/project hortonworks dataflow. Trying to understand the additional benefit that this project does provide

Anna says:

Hortonworks DataFlow provides event streaming ingest and analytics powered by Apache NiFI, Kafka and Storm. With Apache NiFi there is a a significant reduction in cost and effort with a real-time, drag and drop interface that leverages “off the shelf” processors to eliminate the manual coding and scripts often associated with data ingest. Apache NiFi is relatively new Apache NiFi project but was spun out of the NSA where it has first invented 9 years ago. For more info, please refer to Hortonworks Community Connection :

haojia says:

very useful

Leave a Reply

Your email address will not be published. Required fields are marked *