A service for streaming logs into Hadoop
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows; and is robust and fault tolerant with tunable reliability mechanisms for failover and recovery.
YARN coordinates data ingest from Apache Flume and other services that deliver raw data into an Enterprise Hadoop cluster.
Flume lets Hadoop users ingest high-volume streaming data into HDFS for storage. Specifically, Flume allows users to:
|Stream data||Ingest streaming data from multiple sources into Hadoop for storage and analysis|
|Insulate systems||Buffer storage platform from transient spikes, when the rate of incoming data exceeds the rate at which data can be written to the destination|
|Guarantee data delivery||Flume NG uses channel-based transactions to guarantee reliable message delivery. When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event. This ensures guaranteed delivery semantics|
|Scale horizontally||To ingest new data streams and additional volume as needed|
Enterprises use Flume’s powerful streaming capabilities to land data from high-throughput streams in the Hadoop Distributed File System (HDFS). Typical sources of these streams are application logs, sensor and machine data, geo-location data and social media. These different types of data can be landed in Hadoop for future analysis using interactive queries in Apache Hive. Or they can feed business dashboards served ongoing data by Apache HBase.
In one specific example, Flume is used to log manufacturing operations. When one run of product comes off the line, it generates a log file about that run. Even if this occurs hundreds or thousands of times per day, the large volume log file data can stream through Flume into a tool for same-day analysis with Apache Storm or months or years of production runs can be stored in HDFS and analyzed by a quality assurance engineer using Apache Hive.
Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss. Flume also supports dynamic reconfiguration without the need for a restart, which reduces downtime for its agents.
The following components make up Apache Flume:
|Event||A singular unit of data that is transported by Flume (typically a single log entry)|
|Source||The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs.|
|Sink||The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS.|
|Channel||The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel.|
|Agent||Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels.|
|Client||The entity that produces and transmits the Event to the Source operating within the Agent.|
Flume components interact in the following way:
Because Flume’s distributed architecture requires no central coordination point. Each agent runs independently of others with no inherent single point of failure, and Flume can easily scale horizontally.
The Apache Flume community is working to add connectors to make it easier for Flume users to push data from any source into Hadoop and its YARN-enabled ecosystem components. These plans include sinks for streaming for Apache Storm, Apache Solr, Apache Kafka and Apache Spark.
|Apache Flume trunk, used in HDP 2.2||
Introduction R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
A very common request from many customers is to be able to index text in image files; for example, text in scanned PNG files. In this tutorial we are going to walkthrough how to do this with SOLR. Prerequisites Download the Hortonworks Sandbox Complete the Learning the Ropes of the HDP Sandbox tutorial. Step-by-step guide […]
Introduction In this tutorial, you will learn about the different features available in the HDF sandbox. HDF stands for Hortonworks DataFlow. HDF was built to make processing data-in-motion an easier task while also directing the data from source to the destination. You will learn about quick links to access these tools that way when you […]
Introduction JReport is a embedded BI reporting tool can easily extract and visualize data from the Hortonworks Data Platform 2.3 using the Apache Hive JDBC driver. You can then create reports, dashboards, and data analysis, which can be embedded into your own applications. In this tutorial we are going to walkthrough the folllowing steps to […]
Apache Zeppelin on HDP 2.4.2 Author: Vinay Shukla In March 2016 we delivered the second technical preview of Apache Zeppelin, on HDP 2.4. Meanwhile we and the Zeppelin community have continued to add new features to Zeppelin. These features are now available in the final technical preview of Apache Zeppelin. This technical preview works with […]
Introduction Hadoop has always been associated with BigData, yet the perception is it’s only suitable for high latency, high throughput queries. With the contribution of the community, you can use Hadoop interactively for data exploration and visualization. In this tutorial you’ll learn how to analyze large datasets using Apache Hive LLAP on Amazon Web Services […]
Introduction The Azure cloud infrastructure has become a common place for users to deploy virtual machines on the cloud due to its flexibility, ease of deployment, and cost benefits. Microsoft has expanded Azure to include a marketplace with thousands of certified, open source, and community software applications and developer services, pre-configured for Microsoft Azure. This […]
The Hortonworks Sandbox is delivered as a Dockerized container with the most common ports already opened and forwarded for you. If you would like to open even more ports, check out this tutorial.
Apache, Hadoop, Falcon, Atlas, Tez, Sqoop, Flume, Kafka, Pig, Hive, HBase, Accumulo, Storm, Solr, Spark, Ranger, Knox, Ambari, ZooKeeper, Oozie, Phoenix, NiFi, HAWQ, Zeppelin, Atlas, Slider, Mahout, MapReduce, HDFS, YARN, Metron and the Hadoop elephant and Apache project logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States or other countries.