Apache Flume

A service for streaming logs into Hadoop

As the architectural center of Apache™ Hadoop, YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.

Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.

Hortonworks Focus for Flume

The Apache Flume community is working to add connectors to make it easier for Flume users to push data from any source into Hadoop and its YARN-enabled ecosystem components. These plans include sinks for streaming for Apache Storm, Apache Solr, Apache Kafka and Apache Spark.

Recent Progress in Apache Flume

Version Progress
Apache Flume trunk, used in HDP 2.2
  • Apache Hive streaming sink
  • Asynchronous Apache HBase sink
1.5.0
  • ElasticSearch/HTTP support
  • Introduction of SpillableChannel
1.4.0
  • Implement a JMS source for Flume NG
  • Encrypted transport mechanism
  • Running simple configurations embedded in host process

What Flume Does

Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:

Feature Description
Stream data Ingest streaming data from multiple sources into Hadoop for storage and analysis
Insulate systems Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination
Guarantee data delivery Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics
Scale horizontally To ingest new data streams and additional volume as needed

Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.

In one specific example, Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.

Flume Illustration

How Flume Works

Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.

The following components make up Apache Flume:

Component Definition
Event A singular unit of data that is transported by Flume (typically a single log entry)
Source The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.
Sink The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.
Channel The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.
Agent Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.
Client The entity that produces and transmits the Event to the Source operating within the Agent.

Flume components interact in the following way:

  1. A flow in Flume starts from the Client.
  2. The Client transmits the Event to a Source operating within the Agent.
  3. The Source receiving this Event then delivers it to one or more Channels.
  4. One or more Sinks operating within the same Agent drains these Channels.
  5. Channels decouple the ingestion rate from drain rate using the familiar producer-consumer model of data exchange.
  6. When spikes in client side activity cause data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue normal operation for the duration of the spike.
  7. The Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.

Because Flume’s distributed architecture requires no central coordination point. Each agent runs independently of others with no inherent single point of failure, and Flume can easily scale horizontally.

Try these Tutorials

Apache Top-Level Project Since
June 2012
Hortonworks Committers
1

Try Flume with Sandbox

Hortonworks Sandbox is a self-contained virtual machine with HDP running alongside a set of hands-on, step-by-step Hadoop tutorials.

Get Sandbox
More posts on:
Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.