A real-time integrated data logistics and simple event processing platform
Apache NiFi automates the movement of data between disparate data sources and systems, making data ingestion fast, easy and secure.
Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination. It is data source agnostic, supporting disparate and distributed sources of differing formats, schemas, protocols, speeds and sizes such as machines, geo location devices, click streams, files, social feeds, log files and videos and more. It is configurable plumbing for moving data around, similar to how Fedex, UPS or other courier delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real time, just like you could trace a delivery.
Apache NiFi is based on technology previously called “Niagara Files” that was in development and used at scale within the NSA for the last eight years and was made available to the Apache Software Foundation through the NSA Technology Transfer Program. As such, it was designed from the beginning to be field ready—flexible, extensible and suitable for a wide range of devices from a small lightweight network edge device such as a Raspberry Pi to enterprise data clusters and the cloud. Apache NiFi is also able to dynamically adjust to fluctuating network connectivity that could impact communications and thus the delivery of data.
While the term dataflow is used in a variety of contexts, we’ll use it here to mean the automated and managed flow of information between systems. This problem space has been around ever since enterprises had more than one system, where some of the systems created data and some of the systems consumed data. The problems and solution patterns that emerged have been discussed and articulated extensively. A comprehensive and readily consumed form is found in the Enterprise Integration Patterns [eip].
Some of the high-level challenges of dataflow include:
Networks fail, disks fail, software crashes, people make mistakes.
Data access exceeds capacity to consume
Sometimes a given data source can outpace some part of the processing or delivery chain – it only takes one weak-link to have an issue.
Boundary conditions are mere suggestions
You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format.
What is noise one day becomes signal the next
Priorities of an organization change – rapidly. Enabling new flows and changing existing ones must be fast.
Systems evolve at different rates
The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together.
Compliance and security
Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted and accountable.
Continuous improvement occurs in production
It is often not possible to come even close to replicating production environments in the lab.
Over the years dataflow has been one of those necessary evils in an architecture. Now though, there are a number of active and rapidly evolving movements making dataflow a lot more interesting and a lot more vital to the success of a given enterprise. These include things like; Service Oriented Architecture [soa], the rise of the API [api][api2], Internet of Things [iot], and Big Data [bigdata]. In addition, the level of rigor necessary for compliance, privacy, and security is constantly on the rise. Even still with all of these new concepts coming about, the patterns and needs of dataflow are still largely the same. The primary differences then are the scope of complexity, the rate of change necessary to adapt, and that at scale the edge case becomes common occurrence. NiFi is built to help tackle these modern dataflow challenges.
NiFi’s fundamental design concepts closely relate to the main ideas of Flow Based Programming [fbpfbp]. Here are some of the main NiFi concepts and how they map to FBP:
|NiFi Term||FBP Term||Description|
|FlowFile||Information Packet||A FlowFile represents each object moving through the system fand for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes.|
|FlowFile Processor||Black Box||Processors actually perform the work. In [eip] terms a processor is doing some combination of data Routing, Transformation, or Mediation between systems. Processors have access to attributes of a given FlowFile and its content stream. Processors can operate on zero or more FlowFiles in a given unit of work and either commit that work or rollback.|
|Connection||Bounded Buffer||Connections provide the actual linkage between processors. These act as queues and allow various processes to interact at differing rates. These queues then can be prioritized dynamically and can have upper bounds on load, which enable back pressure.|
|Flow Controller||Scheduler||The Flow Controller maintains the knowledge of how processes actually connect and manages the threads and allocations thereof which all processes use. The Flow Controller acts as the broker facilitating the exchange of FlowFiles between processors.|
|Process Group||Subnet||A Process Group is a specific set of processes and their connections, which can receive data via input ports and send data out via output ports. In this manner process groups allow creation of entirely new components simply by composition of other components.|
This design model, also similar to [seda], provides many beneficial consequences that help NiFi to be a very effective platform for building powerful and scalable dataflows. A few of these benefits include:
NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living within the JVM are:
The purpose of the web server is to host NiFi’s HTTP-based command and control API.
The flow controller is the brains of the operation. It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to execute.
There are various types of extensions for NiFi which will be described in other documents. But the key point here is that extensions operate/execute within the JVM.
The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The implementation of the repository is pluggable. The default approach is a persistent Write-Ahead Log that lives on a specified disk partition.
The Content Repository is where the actual content bytes of a given FlowFile live. The implementation of the repository is pluggable. The default approach is a fairly simple mechanism, which stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged in order to reduce contention on any single volume.
The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation which includes one or more physical disk volumes. Within each location event data is indexed and searchable.
NiFi is also able to operate within a cluster.
A NiFi cluster is comprised of one or more NiFi Nodes (Node) controlled by a single NiFi Cluster Manager (NCM). The design of clustering is a simple master/slave model where the NCM is the master and the Nodes are the slaves. The NCM’s reason for existence is to keep track of which Nodes are in the cluster, their status, and to replicate requests to modify or observe the flow. Fundamentally, then, the NCM keeps the state of the cluster consistent. While the model is that of master and slave, if the master dies the Nodes are all instructed to continue operating as they were to ensure the data flow remains live. The absence of the NCM simply means new nodes cannot join the cluster and cluster flow changes cannot occur until the NCM is restored.
NiFi is designed to fully leverage the capabilities of the underlying host system it is operating on. This maximization of resources is particularly strong with regard to CPU and disk. Many more details will be provided on best practices and configuration tips in the Administration Guide.
The throughput or latency one can expect to see will vary greatly on how the system is configured. Given that there are pluggable approaches to most of the major NiFi subsystems, the performance will depend on the implementation. But, for something concrete and broadly applicable, let’s consider the out-of-the-box default implementations that are used. These are all persistent with guaranteed delivery and do so using local disk. So being conservative, assume roughly a 50 MB/s read/write rate on modest disks or RAID volumes within a typical server. NiFi for a large class of dataflows then should be able to efficiently reach 100 or more MB/s of throughput. This is because linear growth is expected for each physical partition and content repository added to NiFi. This will bottleneck at some point on the FlowFile repository and provenance repository. We plan to provide a benchmarking/performance test template to include in the build, which will allow users to easily test their system and to identify where bottlenecks are and at which point they might become a factor. It should also make it easy for system administrators to make changes and to verify the impact.
The Flow Controller acts as the engine dictating when a particular processor will be given a thread to execute. Processors should be written to return the thread as soon as they’re done executing their task. The Flow Controller can be given a configuration value indicating how many threads there should be for the various thread pools it maintains. The ideal number of threads to use will depend on the resources of the host system in terms of numbers of cores, whether that system is running other services as well, and the nature of the processing in the flow. For typical IO heavy flows, though, it would be quite reasonable to set many dozens of threads to be available if not more.
NiFi lives within the JVM and is thus generally limited to the memory space it is afforded by the JVM. Garbage collection of the JVM becomes a very important factor to both restricting the practical size of the heap as well as how well the application runs over time.
A core philosophy of NiFi has been that even at very high scale, guaranteed delivery is a must. This is achieved through effective use of a purpose-built persistent write-ahead log and content repository. Together they are designed in such a way as to allow for very high transaction rates, effective load-spreading, copy-on-write, and to play to the strengths of traditional disk read/writes.
Data Buffering w/ Back Pressure and Pressure Release
NiFi supports buffering of all queued data as well as the ability to provide back pressure as those queues reach specified limits or to age off data as it reaches a specified age (its value has perished).
NiFi allows the setting of one or more prioritization schemes for how data is retrieved from a queue. The default is oldest first, but there are times when data should be pulled newest first, largest first, or some other custom scheme.
Flow Specific QoS (latency v throughput, loss tolerance, etc.)
There are points of a dataflow where the data is absolutely critical and it is loss intolerant. There are also times when it must be processed and delivered within seconds to be of any value. NiFi enables the fine-grained flow specific configuration of these concerns.
NiFi automatically records, indexes, and makes available provenance data as objects flow through the system – even across fan-in, fan-out, transformations, and more. This information becomes extremely critical in supporting compliance, troubleshooting, optimization, and other scenarios.
Recovery / Recording a rolling buffer of fine-grained history
NiFi’s content repository is designed to act as a rolling buffer of history. Data is removed only as it ages off the content repository or as space is needed. This combined with the data provenance capability makes for an incredibly useful basis to enable click-to-content, download of content, and replay, all at a specific point in an object’s lifecycle which can even span generations.
Visual Command and Control
Dataflows can become quite complex. Being able to visualize those flows and express them visually can help greatly to reduce that complexity and to identify areas that need to be simplified. NiFi enables not only the visual establishment of dataflows but it does so in real-time. Rather than being design and deploy it is much more like molding clay. If you make a change to the dataflow that change immediately takes effect. Changes are fine-grained and isolated to the affected components. You don’t need to stop an entire flow or set of flows just to make some specific modification.
Dataflows tend to be highly pattern oriented and while there are often many different ways to solve a problem, it helps greatly to be able to share those best practices. Templates allow subject matter experts to build and publish their flow designs and for others to benefit and collaborate on them.
System to system
A dataflow is only as good as it is secure. NiFi at every point in a dataflow offers secure exchange through the use of protocols with encryption such as 2-way SSL. In addition, NiFi enables the flow to encrypt and decrypt content and use shared-keys or other mechanisms on either side of the sender/recipient equation.
User to system
NiFi enables 2-Way SSL authentication and provides pluggable authorization so that it can properly control a user’s access and at particular levels (read-only, dataflow manager, admin). If a user enters a sensitive property like a password into the flow, it is immediately encrypted server side and never again exposed on the client side even in its encrypted form.
Designed for Extension
NiFi is at its core built for extension and as such it is a platform on which dataflow processes can execute and interact in a predictable and repeatable manner.
Points of extension
Processors, Controller Services, Reporting Tasks, Prioritizers, Customer User Interfaces
For any component-based system, dependency nightmares can quickly occur. NiFi addresses this by providing a custom class loader model, ensuring that each extension bundle is exposed to a very limited set of dependencies. As a result, extensions can be built with little concern for whether they might conflict with another extension. The concept of these extension bundles is called NiFi Archives and will be discussed in greater detail in the developer’s guide.
NiFi is designed to scale-out through the use of clustering many nodes together as described above. If a single node is provisioned and configured to handle hundreds of MB/s then a modest cluster could be configured to handle GB/s. This then brings about interesting challenges of load balancing and fail-over between NiFi and the systems from which it gets data. Use of asynchronous queuing based protocols like messaging services, Kafka, etc., can help. Use of NiFi’s site-to-site feature is also very effective as it is a protocol that allows NiFi and a client (could be another NiFi cluster) to talk to each other, share information about loading, and to exchange data on specific authorized ports.
MiNiFI is a subproject of NiFi designed to solve the difficulties of managing and transmitting data feeds to and from the source of origin, often the last mile of digital signal, enabling edge intelligence to adjust flow behavior/bi-directional communication
Since the last mile (the far edge), is very distributed and likely involves a very large number of end devices (ie. IoT), MiNiFi carries over all the main capabilities of NiFi, with the exception of concept of immediate command and control, and creating a design and deploy paradigm that make uniform management of a vast number of devices more practical.
It also means that MiNiFi has a much smaller footprint than NiFi, with a range less than 40MB, depending which option is selected – MiNiFi with the Java Agent, or a Native C++ agent.
To participate in the Apache MiNiFi community : https://cwiki.apache.org/confluence/display/MINIFI/MiNiFi