Announcing HDP Tech Preview Component: Apache Kafka
We are excited to announce that Apache Kafka 0.8.1.1 is now available as a technical preview with Hortonworks Data Platform 2.1. Kafka was originally developed at LinkedIn and incubated as an Apache project in 2011. It graduated to a top-level Apache project in October of 2012.
Many organizations already use Kafka for their data pipelines, including Hortonworks customers like Spotify and Tagged.
What is Apache Kafka?
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, replication, and fault tolerance.
Apache Kafka Use Cases
As a general-purpose messaging system Kafka supports a wide range of use cases, particularly in scenarios where high throughput, reliable delivery, and horizontal scalability are important. Common use cases include:
- Stream Processing
- Website Activity Tracking
- Metrics Collection and Monitoring
- Log Aggregation
Some of the important qualities that make Kafka such an attractive option for these use cases include:
- Scalability: Kafka is a distributed system that scales easily with no downtime.
- Durability: Kafka persists messages on disk, and provides intra-cluster replication.
- Reliability: Kafka provides reliability of data through replication. It supports multiple subscribers and automatically balances consumers during failure.
- Performance: It offers high throughput for both publishing and subscribing, with O(1) disk structures that provide constant time performance, even with many terabytes of stored messages.
Kafka Under the Hood
Kafka’s system design can be thought of as that of a distributed commit log, where incoming data is written sequentially to on-disk data structures. To get a better understanding of how it works, lets begin by looking at the main components involved in moving data in and out of Kafka:
In Kafka a Topic is a user-defined category to which messages are published. Kafka Producers publish messages to one or more topics and Consumers subscribe to topics and process the published messages. Finally, a Kafka cluster consists of one or more servers, called Brokers that manage the persistence and replication of message data (i.e. the commit log).
One of the keys to Kafka’s high performance is the simplicity of the Brokers’ responsibilities. In Kafka, Topics consist of one or more Partitions that are ordered, immutable sequences of messages. Since writes to a Partition are sequential, this design greatly reduces the number of hard disk seeks and the resulting latency.
Another factor contributing to Kafka’s performance and scalability is the fact that Kafka Brokers are not responsible for keeping track of what messages have been consumed – that responsibility falls on the Consumer. In traditional messaging systems such as JMS, the Broker bore this responsibility, severely limiting the system’s ability to scale as the number of Consumers increased.
For Kafka Consumers, keeping track of which messages have been consumed (processed) is simply a matter of keeping track of an Offset, which is a sequential id number that uniquely identifies a message within a partition. Because Kafka retains all messages on disk (for a configurable amount of time), Consumers can rewind or skip to any point in a partition simply by supplying an Offset value. Finally, this design eliminates the potential for back-pressure when consumers process messages at different rates.
Kafka and Apache Storm – A Match Made in Heaven
Kafka’s qualities make it an ideal companion to Apache Storm, a distributed real-time computation system. Combining both technologies enables stream processing at linear scale and assures that every message is reliably processed in real-time. Apache Storm ships with full support for using Kafka 0.8.1 as a data source with both Storm’s core API (Spouts and Bolts) as well as the higher-level, micro-batching Trident API. This allows you to choose the messaging reliability guarantees that best fit your individual use case, either:
- At-Most-Once processing,
- At-Least-Once processing, or
- Exactly-Once processing
Storm’s Kafka integration also includes support for writing data to Kafka, enabling complex data flows between components in a Hadoop-based architecture.
Given this degree of flexibility, a common architectural pattern has emerged where Kafka is used to front an unreliable data source, such as the Twitter firehose, in order to provide fault tolerance and guaranteed message delivery.
Watch this blog for future posts on further developments with Apache Kafka in Hortonworks Data Platform.
Discover and Learn More
- Presentation on Apache Kafka by Storm committer Michael Noll
- Download the Apache Kafka Centos 6 rpm
- Try the Hortonworks Kafka updated tutorial