Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics, offering information and knowledge of the Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Introducing Hortonworks DataFlow

Recorded on September 23rd, 2015

This webinar will provide an overview of Hortonworks DataFlow (HDF), how it complements Hortonworks Data Platform, and the future roadmap. Join us to learn more about how to securely and easily collect, conduct, and curate dynamic Internet of Anything data into actionable insights for your business.

 

We have responded to the remaining questions you previously submitted during the live webinar:

Q: Is there a separate support license for HDF vs. HDP?

A: Hortonworks DataFlow (HDF) is a completely separate product from Hortonworks Data Platform (HDP). They can certainly work together but they are separate support subscriptions. Please contact sales to learn more about pricing.

 

Q: Is there anything proprietary that will be included in HDF but not in Apache NiFi?

A: No. HDF, just like HDP, will follow the same Hortonworks philosophy of innovation in the open as described here.

 

Q: Will this deck be available to the attendees?

A: The webinar recording is featured below as well as on the Hortonworks Channel.

 

Q: Does HDF address delta load from Oracle database to HDFS?

A: HDF powered by Apache NiFi does support interaction with databases though it is narrowly focused. The SQL Processor set available today does not yet offer a complete change data capture solution. At a framework level, this use case is readily supportable. We expect to see increasing priority on providing a high quality user experience around database-oriented change data capture as we move forward.

 

Q: I think its page 34 – it shows ‘clone’ – could you please explain the slide with an example

A: ‘Clone’ is one of the types of provenance primitives that are supported. It means we created a new ‘flow file’ that was wholly derived from an already existing one. It is important to understand that under the covers we didn’t actually have to clone anything. We simply increased the number of references to the underlying content of that flow file and created a new flow file reference pointing to it. This means the extremely common case of cloning and multi-routing data has very high performance and we’re able to retain more data.

 

Q: How do you do deeper curation of the data, for example metadata extraction and master and reference data management?

A: The ‘flow file’ construct allows us to keep track of ‘attributes’ of each object in the flow as well as their ‘content’ or ‘payload’. It is common to have processors that exist solely to extract features from content and store them as attributes. This is very powerful because now these attributes are available for fast in-memory decision making and routing and we don’t need to keep scanning content to do it. This means NiFi is accumulating context about an object as it goes through a flow. Furthermore, NiFi is tracking the lineage of the object as it flows through the system, which is also an important part of this context. Now, with these accumulated attributes, you can also have processors that actually alter the content of the data (if need be). How does this relate to master data management? First, a master dataset could be used as a reference set in the flow to validate data being seen or for use in enriching data as it flows. Second, NiFi could be updating or notifying the master dataset in real-time as data flows.

 

Q: I do not see GetTwitter on the NiFi Processor list. Where can I find that?

A: Please try downloading the latest HDF or Apache NiFi release. GetTwitter is part of the latest release. When you drag a processor onto the graph in NiFi you can search ‘get’ or ‘twitter’ and it will be available.

 

Q: How are the node failures in the cluster handled? Does the cluster manager keep track of the data being processed on the failed node?

A: In considering high availability, let’s look at ‘flow control’ and ‘data’ as two separate things. In HDF powered by Apache NiFi, the control plane does not have high availability so, if the cluster manager goes down, the flow cannot be seen or altered. However, the system is designed to be quite fault-tolerant and each of the nodes continues to operate the flow despite not having access to the manager. The nodes cache several things with the expectation that the manager may become unavailable. Once the manager is restored to operation, normal control resumes. On the data plane there are also two dimensions to consider – what to do with data already in the flow and what to do with new/live data feeds coming in. For existing data, if the node that has the data goes down we rely on traditional RAID techniques or reliable attached storage mechanisms to keep the data safe while the node is off-line. For live data, we promote the use of protocols and offer our own (called site-to-site) that honor backpressure and have solid fault-tolerance behaviors like automated load-balancing and fail-over. Even if nodes drop out of the cluster, due to failure, planned maintenance, or newly added nodes, the behavior of the cluster is designed to automatically scale as needed.

 

Q: You mentioned the 90 processors offered out of the box in Apache NiFi. Do all of them have standard data flow and tuning standards built in?

A: All processors come with default values configured as determined by the developers that created them. However, there are often additional settings for which no ‘good’ default is known. This is acceptable because, as a processor is added to the flow, NiFi provides helpful in-line indicators of what needs to be configured and also provides in-line context help and ready access to a specific usage guide for each component.

 

Q: How different is this from Flume, Kafka, or other data ingestion frameworks?

A: Kafka is a messaging system. Messaging systems generally are focused on providing mail-box like semantics whereby the ‘provider’ of data is decoupled from the ‘consumer’ of that data at least on a physical connectivity level. In enterprise dataflows, however, there are many other forms of decoupling to consider that are also critical. Protocol, format, schema, priority, and interest are all examples of important ‘separations of concern’ to consider. HDF powered by Apache NiFi is designed to address all of these forms of decoupling. In so doing, NiFi is often used with a system like Kafka, which is aimed at addressing one of those forms of decoupling but does so in a manner that can lead to very high performance under specific usage patterns. Kafka doesn’t address the user experience and real-time command and control aspects of the data lineage capabilities offered by HDF powered by Apache NiFi. The type of security that can be offered by messaging based systems will be largely limited to transport security, encryption of data at rest, and white-list style authorization to topics. HDF offers similar approaches as well but since it actually operates on and with the data it can also perform fine-grained security checks and rule-based contextual authorization. In the end, these systems are designed to tackle different parts of the data flow problem are often used together as a more powerful whole.

The comparison of HDF (and a flow using the “GetFile” processor along with a “PutHDFS” processor) to Flume is a more direct comparison in that they were designed to address very similar use cases. HDF offers data provenance as well as a powerful and intuitive user experience with a drag-and-drop UI for interactive command and control. From the management and data tracking perspectives HDF and Flume offer quite a different feature set. That said, Flume has been used considerably for some time now and, as is true with any system, the goal of HDF is to integrate with it in the best manner possible. As a result, HDF powered by Apache NiFi supports running Flume sources and sinks right in the flow itself. You can now wire many Flume sources and do so in a way that combines Flume’s configuration file approach with NiFi’s UI driven approach offering a best of both worlds solution.

 

Q: Does NiFi (as part of the framework) provide a way to distribute the load among the consumers?

A: Certainly. Out of the box NiFi offers a protocol with an associated client library available in Java called ‘site-to-site’. Using this protocol, a client can communicate with a NiFi node or cluster with automated load balancing and fail-over of live data flow. As an example, this is exactly how the “Spark Receive” processor works and is how two NiFi clusters exchange data. This feature was originally designed to provide powerful delivery of data between data centers but has been used extensively within the data center as well.

Leave a Reply

Your email address will not be published. Required fields are marked *