In Part 1 of this series, we discussed how data-in-motion solutions require both flow management and stream analytics capabilities. Also, we introduced an exciting new technology that Hortonworks is in the process of releasing that helps users build streaming analytics apps faster and caters to three different personas in the enterprise: app developer, operations teams and the business analyst.
So you are probably asking the following questions:
The rest of this blog answers these questions. Read on!
When we set out to build the tools necessary to allow developers to create streaming apps faster, one of the first challenges that we had to solve was how to manage schemas for events in the streaming world. When a developer builds a typical streaming app which includes connecting to a stream source (e.g: Kafka), applying filtering rules, performing aggregations over a time window, applying transformations, and so on, a schema is required to do any of these functions.
Today, most streaming apps hard code the schema and the serialization/deserialization into the streaming app itself which is an anti-pattern that prevents schemas from being re-used and does not support the governance and operational needs of most enterprises.
Hortonworks is working on an open source project focused on a shared Schema Registry which aims to solves these challenges.
The shared Schema Registry provides a central schema repository that allows applications and HDF components (NiFi, Storm, Kafka, and others) to flexibly interact with each other.
Applications built using HDF often need a way to share metadata across 3 dimensions:
The fundamental design principle driving the Schema Registry effort is to tackle the challenges of managing and sharing schemas between HDF components and applications so that schema evolution is supported. The goal is to allow a consumer and producer to understand different schema versions but still read all information shared between both versions and safely ignore the rest.
Hence, the value that a shared Schema Registry provides for HDF and the applications that integrate with it are the following:
When Apache Kafka is integrated into enterprise organization deployments, you typically have many different Kafka topics used by different apps and users. With the adoption of Kafka within the enterprise, some frequent key questions include:
While Kafka topics do not have a schema, having an external store that tracks this metadata for a given topic helps to answer these common questions. A shared Schema Registry addresses this use case. You can integrate with Schema Registry in a number of ways, including:
The below image showcases the Schema Registry User Interface that describes a schema registered for the values in a kafka topic called truck_events_avro:
One key point to highlight is that the Schema Registry is not just a schema metastore for Kafka. The shared Schema Registry was designed to provide a centralized versioned schema store for any type of event data store (e.g: log files, other messaging systems, cloud services). In addition, the design of this architecture allows for a schema to be one type of entity in the registry. We envision a registry of different versioned entities including the following:
We will share our vision of these different types of registries in the coming months.
The shared Schema Registry will be part of the Hortonworks HDF Platform in the next release of Hortonworks DataFlow. Here is a preview of some of the HDF integrations with the Schema Registry that will be available in the next release:
The next blog of this series will provide more details on the new streaming analytics platform that Hortonworks has been working on for the last 8 months. Stay Tuned!