cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button

Apache Kafka

A fast, scalable, fault-tolerant messaging system

Apache™ Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. Kafka is often used in place of traditional message brokers like JMS and AMQP because of its higher throughput, reliability and replication.

Kafka works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data. Kafka can message geospatial data from a fleet of long-haul trucks or sensor data from heating and cooling equipment in office buildings. Whatever the industry or use case, Kafka brokers massive message streams for low-latency analysis in Enterprise Apache Hadoop.

What Kafka Does

Apache Kafka supports a wide range of use cases as a general-purpose messaging system for scenarios where high throughput, reliable delivery, and horizontal scalability are important. Apache Storm and Apache HBase both work very well in combination with Kafka. Common use cases include:

  • Stream Processing
  • Website Activity Tracking
  • Metrics Collection and Monitoring
  • Log Aggregation

Some of the important characteristics that make Kafka such an attractive option for these use cases include the following:

Feature Description
Scalability
    Distributed system scales easily with no downtime
Durability
    Persists messages on disk, and provides intra-cluster replication
Reliability
    Replicates data, supports multiple subscribers, and automatically balances consumers in case of failure
Performance
    High throughput for both publishing and subscribing, with disk structures that provide constant performance even with many terabytes of stored messages

How Kafka Works

Kafka’s system design can be thought of as that of a distributed commit log, where incoming data is written sequentially to disk. There are four main components involved in moving data in and out of Kafka:

  • Topics
  • Producers
  • Consumers
  • Brokers

Kafka Cluster Diagram

In Kafka, a Topic is a user-defined category to which messages are published. Kafka Producers publish messages to one or more topics and Consumers subscribe to topics and process the published messages. Finally, a Kafka cluster consists of one or more servers, called Brokers that manage the persistence and replication of message data (i.e. the commit log).

Kafka Partition Diagram

One of the keys to Kafka’s high performance is the simplicity of the brokers’ responsibilities. In Kafka, topics consist of one or more Partitions that are ordered, immutable sequences of messages. Since writes to a partition are sequential, this design greatly reduces the number of hard disk seeks (with their resulting latency).

Another factor contributing to Kafka’s performance and scalability is the fact that Kafka brokers are not responsible for keeping track of what messages have been consumed – that responsibility falls on the consumer. In traditional messaging systems such as JMS, the broker bore this responsibility, severely limiting the system’s ability to scale as the number of consumers increased.

Kafka Broker Diagram

For Kafka consumers, keeping track of which messages have been consumed (processed) is simply a matter of keeping track of an Offset, which is a sequential id number that uniquely identifies a message within a partition. Because Kafka retains all messages on disk (for a configurable amount of time), consumers can rewind or skip to any point in a partition simply by supplying an offset value. Finally, this design eliminates the potential for back-pressure when consumers process messages at different rates.

Recent Kafka Releases

Hortonworks is working to make Kafka easier for enterprises to use . New focus areas include creation of a Kafka Admin Panel to create/delete topics and manage user permissions,  easier and safer distribution of security tokens and support for multiple ways of publishing/consuming data via a Kafka REST server/API.

Apache Kafka Version Enhancements HDP Version HDF Version
0.10.0.1
  • Message Timestamps
  • Automated Replica Leader Election
  • Rack Awareness
  • New Consumer APIs
  • More stability to Producer APIs
2.5 2.0
0.9.0.1
  • Wire encryption using SSL
  • SASL support
  • User defined quotas
  • New Producer APIs
2.4 1.2

Latest Developments

  • Rack awareness for Increased resilience and availability such that replicas are isolated so they are guaranteed to span multiple racks or availability zones.
  • Automated replica leader election for automated, even distribution of leaders in a cluster capability by detecting uneven distribution with some brokers serving more data compared to others and makes adjustments.
  • Message Timestamps so every message in Kafka now has a timestamp field that indicates the time at which the message was produced.
  • SASL improvements including external authentication servers and support of multiple types of SASL authentication on one server
  • Ambari Views for visualization of Kafka operational metrics

Kafka Security

  • Kafka security encompasses multiple needs – the need to encrypt the data flowing through Kafka and preventing rogue agents from publishing data to Kafka, as well as the ability to manage access to specific topics on an individual or group level.
  • As a result, latest updates in Kafka support wire encryption via SSL, Kerberos based authentication and granular authorization options via Apache Ranger or other pluggable authorization system.

Forums

Kafka Tutorials

Kafka in our Blog

Kafka in the Press

Webinars & Presentations