Hortonworks DataFlow has been seeing great success being deployed in multiple use cases. We recently shared a set of real-world use cases on a webinar, and also wanted to share here so readers can peruse which types of uses cases are being implemented and see if there are parallels between current and future users of this fast, easy secure data ingestion technology.
To quickly recap, Hortonworks DataFlow is a platform designed to serve the needs of data in motion – data collection and edge intelligence to high-scale event stream processing of real-time data. It is complementary to Hortonworks Data Platform which is designed to serve the needs of data at rest. Hortonworks DataFlow was first released in Sep 2015, following the acquisition of Onyara, the creator of and key contributor to Apache NiFi, open source technology made available through the NSA Tech Transfer program.
With Release 1.2, Hortonworks DataFlow expanded to also include Apache Kafka and Apache Storm to handle the demanding needs of real-time complex event processing. Since dataflow and stream processing are integrally correlated to each other, this created a single integrated offering with data in motion. Most recently, HDF 2.0 was released, which added Apache Ambari and Apache Ranger as additional enterprise capability options.
In the past year, here are some of the use cases we’ve been seeing, collected in one place.
Hortonworks DataFlow Use Cases
- Royal Mail “Lee-Warren speaks about how he wanted to free up his data insights team from spending ninety percent of their time “ferrying data backwards and forwards” from its Teradata data warehouse, to spending ninety percent of their time exploiting that data and making it available to the rest of the business.” More Info and video
- Open Energi uses HDF for electricity demand response, reducing costs 10-15% less data being transmitted across a mobile network, creating a full transparent trail for data provenance that Open Energi can share with customers, enabling line of business teams to contribute to building dataflow rules and processes and standardizing the output of data across various end point devices. More info
- Prescient pulls data and tunes algorithms from 49,000+ data sources to identify threats to traveller safety and has seen 700% improvement in analyst productivity in determining actual threats. More info
- Centrica uses both Hortonworks Data Platform (HDP™) and Hortonworks DataFlow (HDF™) to power its data analytics and simplify the estate of its IT portfolio. More info.
Apache NiFi Use Cases shared at Hadoop Summit San Jose
Apache NiFi Use Cases shared at Hadoop Summit Tokyo
- Coca Cola East Japan: Runs HortonworksDataPlatform for 20TB on Azure as well as using ApacheNiFi since 2015 to stream data
Other Apache NiFi Use Cases:
Should you have any more questions, anytime, we encourage you to check out the Data Ingestion & Streaming track of Hortonworks Community Connection where an entire community of folks are monitoring and responding to questions. You may also want to check out our HDF home page, HDF support and HDF operations training pages.
Q&A from Oct 18 Webinar:
- Is there a processor for SFDC?
- No, but you are able to integrate with SFDC via the REST API. More info here: https://community.hortonworks.com/questions/12892/salesforce-integration-with-hortonworks-data-flow.html
- Which version of HDF were you showing during the webinar?
- How does NiFi scale up/down vs scale out?
- In typical use cases we see the number of NiFI nodes to be 3-20. For each node, NiFI is designed to scale up very well to take full advantage of all the resources you have, in terms of CPU and disks. We typically see processing of 50-ish MB/sec per node, which is equivalent to processing several terabytes of data per day on a single node. More specifically:
- Scale-out (Clustering): NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured to handle hundreds of MB per second, then a modest cluster could be configured to handle GB per second. This then brings about interesting challenges of load balancing and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc., is recommended but other approaches are supported as well. Use of NiFi’s site-to-site feature is also very effective as it is a protocol that allows NiFi and a client (including another NiFi cluster) to talk to each other, share information about loading, and to exchange data on specific authorized ports with automated load balancing and fail-over.
- Scale-up & down: NiFi is also designed to scale-up and down in a very flexible manner. In terms of increasing throughput from the standpoint of the NiFi framework, it is possible to increase the number of concurrent tasks on the processor under the Scheduling tab when configuring. This allows more processes to execute simultaneously, providing greater throughput. On the other side of the spectrum, you can very effectively scale NiFi down to be suitable to run on edge devices where a small footprint is desired due to limited hardware resources. To specifically solve the first mile data collection challenge and edge use cases, you can find more details here: https://cwiki.apache.org/confluence/display/NIFI/MiNiFi regarding a child project effort of Apache NiFi, MiNiFi (pronounced “minify”, [min-uh-fahy]).
- “Where can we apply business rules and policies in the pipeline?”
- There are a number of ways to accomplish this. First, the flow definition itself is an expression of business rules and policies and it is dynamically adjustable. It is common for users to have external services interacting with the REST-based API of NiFi at runtime to make changes to those rules and policies. Other common techniques available include having processors that are acting as enrichment steps or routing checks where they’re interacting with some back-end service or database that holds these rules and the processor itself enforces those rules as data flows through. Some users also have done things like integrated Drools or other engines into processors as well and then normal NiFi configuration is applied to hold and enforce those rules at runtime. This is an area where NiFi can really shine and offers a lot of flexibility.