People have been asking us – Is Google Cloud Dataflow the same thing as Hortonworks DataFlow (HDF)? So we thought we’d take the opportunity to share with you how we see these two products work together. Both have the word dataflow in their name, and both systems are rooted in the premise of dataflow programming, but beyond that there are significant differences.
Google Cloud Dataflow provides an abstraction to systems processing and analyzing data streams, such as MapReduce
Google Cloud Dataflow provides an abstraction layer to systems processing and analyzing data streams such as MapReduce, and is designed strictly for the Google Compute Cloud (ie a virtual data center). Google has donated this technology to the Apache Software Foundation, an open source community, as the currently incubating Apache Beam project.
Apache Beam can be considered an API for writing high-scale analytics and processing tasks which operate on streams or batches naturally and which provide an abstraction that rides above any particular underlying execution engine. Why is this cool? Because, for example, you can write the same analytic in Spark and have it run in Flink without having to rewrite it or you can migrate to a cloud provided service like Google Cloud Dataflow.
Hortonworks DataFlow is a product designed to solve data acquisition and delivery challenges – data logistics
Hortonworks DataFlow (HDF) is a product designed to solve dataflow challenges, inside OR outside of a data center. It’s data logistics, a concept similar to that of supply chain logistics. HDF provides a fast, easy and secure way to move data from anywhere it originates – i.e. a remote branch office or a Raspberry Pi-enabled camera – to anywhere it needs to go – i.e. Google Cloud Dataflow or your on-prem datacenter – and is powered by Apache NiFi.
Why is this cool? Because getting data from where it originates to where it needs to go isn’t an easy feat. In theory it’s simple, like a map showing directions from A to B. In reality, it’s complicated and there can be unforeseen events just like real-world traffic patterns – road construction, detours, traffic jams, no-left-turn signs, broken traffic lights, flooded roadways, toll highways, one-way streets. And like real-life, it’s not 100% predictable and dataflows need to adapt in real time in order to serve the analytics systems processing the data. That’s what Hortonworks DataFlow does – it’s a smart, interactive real-time control system to make sure your data gets to where it needs to go – with full event level data provenance to boot – for those concerned about how trustworthy their data is. (More about what provenance is here)
So the answer is that Google Cloud Dataflow and Hortonworks DataFlow have similarly sounding names to describe important, but very different things. One is a processing framework waiting for data to be delivered to it; and the other one – Hortonworks DataFlow – delivers data to all kinds of processing systems: Google Cloud Dataflow, Storm, Spark, Flink, etc.
For more info: