As the architectural center of Apache™ Hadoop, YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
Hortonworks Focus for Flume
The Apache Flume community is working to add connectors to make it easier for Flume users to push data from any source into Hadoop and its YARN-enabled ecosystem components. These plans include sinks for streaming for Apache Storm, Apache Solr, Apache Kafka and Apache Spark.
Recent Progress in Apache Flume
|Apache Flume trunk, used in HDP 2.2||
What Flume Does
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:
|Stream data||Ingest streaming data from multiple sources into Hadoop for storage and analysis|
|Insulate systems||Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination|
|Guarantee data delivery||Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics|
|Scale horizontally||To ingest new data streams and additional volume as needed|
Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
In one specific example, Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.
How Flume Works
Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.
The following components make up Apache Flume:
|Event||A singular unit of data that is transported by Flume (typically a single log entry)|
|Source||The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.|
|Sink||The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.|
|Channel||The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.|
|Agent||Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.|
|Client||The entity that produces and transmits the Event to the Source operating within the Agent.|
Flume components interact in the following way:
- A flow in Flume starts from the Client.
- The Client transmits the Event to a Source operating within the Agent.
- The Source receiving this Event then delivers it to one or more Channels.
- One or more Sinks operating within the same Agent drains these Channels.
- Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
- When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike.
- The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.
Because Flume’s distributed architecture requires no central coordination point. Each agent runs independently of others with no inherent single point of failure, and Flume can easily scale horizontally.