cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
May 16, 2017
prev slideNext slide

Part 2 of HDF Blog Series: A Shared Schema Registry: What is it and Why is it Important?

In Part 1 of this series, we discussed how data-in-motion solutions require both flow management and stream analytics capabilities. Also, we introduced an exciting new technology that Hortonworks is in the process of releasing that helps users build streaming analytics apps faster and caters to three different personas in the enterprise: app developer, operations teams and the business analyst.  

So you are probably asking the following questions:

  • What is the shared Schema Registry?
  • What does it have to do with a streaming analytics platform?
  • Why is it important?

The rest of this blog answers these questions. Read on!

Streaming Apps Require a Schema

When we set out to build the tools necessary to allow developers to create streaming apps faster, one of the first challenges that we had to solve was how to  manage schemas for events in the streaming world. When a developer builds a typical streaming app which includes connecting to a stream source (e.g: Kafka), applying filtering rules, performing aggregations over a time window, applying transformations, and so on, a schema is required to do any of these functions.

Today, most streaming apps hard code the schema and the serialization/deserialization into the streaming app itself which is an anti-pattern that prevents schemas from being re-used and does not support the governance and operational needs of most enterprises.

Hortonworks is working on an open source project focused on a shared Schema Registry which aims to solves these challenges.

An Open Source Shared Schema Registry

The shared Schema Registry provides a central schema repository that allows applications and HDF components (NiFi, Storm, Kafka, and others) to flexibly interact with each other.

Applications built using HDF often need a way to share metadata across 3 dimensions:

  • Data format
  • Schema
  • Semantics or meaning of the data

The fundamental design principle driving the Schema Registry effort is to tackle the challenges of managing and sharing schemas between HDF components and applications  so that schema evolution is supported. The goal is to allow  a consumer and producer to understand different schema versions but still read all information shared between both versions and safely ignore the rest.

Hence, the value that a shared Schema Registry provides for HDF and the applications that integrate with it are the following:

  • Centralized registry – Provide reusable schema to avoid attaching schema to every piece of data.
  • Version management – Define relationship between schema versions so that consumers and producers can evolve at different rates.
  • Schema validation – Enable generic format conversion, generic routing, and data quality.

Common Use Case for Schema Registry: Schema for Apache Kafka Topics/Events

When Apache Kafka is integrated into enterprise organization deployments, you typically have many different Kafka topics used by different apps and users. With the adoption of Kafka within the enterprise, some frequent key questions include:

  • What are the different events in a given Kafka topic?
  • What do I put into a given Kafka topic?
  • Do all Kafka events have a similar type of schema?
  • How do I parse and use the data in a given Kafka topic?

While Kafka topics do not have a schema, having an external store that tracks this metadata for a given topic helps to answer these common questions. A shared Schema Registry addresses this use case. You can integrate with Schema Registry in a number of ways, including:

  • REST APIs
  • Schema Registry java client
  • Schema Registry User Interface

The below image showcases the Schema Registry User Interface that describes a schema registered for the values in a kafka topic called  truck_events_avro: 

 

A Registry Designed for more than just Kafka or Schemas

One key point to highlight is that the Schema Registry is not just a schema metastore for Kafka. The shared Schema Registry was designed to provide a centralized versioned schema store for any type of event data store (e.g: log files, other messaging systems, cloud services). In addition, the design of this architecture allows for a schema to be one type of entity in the registry. We envision a registry of different versioned entities including the following:

  • Template Registry – A Registry service  for templates for stream apps, NiFi flows, etc.
  • Model Registry –  A Registry service for managing predictive models (e.g: PMML, POJO based, etc)
  • Rules Registry – A Registry service for declarative rules.

We will share our vision of these different types of registries in the coming months.

Powerful HDF Integration with the Shared Schema Registry

The shared Schema Registry will be part of the Hortonworks HDF Platform in the next release of Hortonworks DataFlow.  Here is a preview of some of the HDF integrations with the Schema Registry that will be available in the next release:

  • Apache NiFi Integration (as part of Apache NiFi 1.2.0 Release)  – With the Apache NiFi 1.2.0 release, a key feature that was introduced was the notion of RecordReaders and RecordWriters. These new controller services allow  you to convert events from one type (json, xml, csv, avro) to another (json, xml, csv, avro) and perform processing (e.g:  SQL).  To enable this, a schema needs to be projected onto the events. NiFi’s integration with this shared Schema Registry to look up the schema is the key enabler for this feature. To get more details on this  integration, check out the following HCC blog: Record based processors in Apache NiFi 1.2. The below shows what this integration would look like:  

  • Schema Registry and Stream Processing Integration – In Part 1 of our blog series, we introduced Hortonworks’ thoughts on building the next generation of streaming apps. The shared Schema Registry is an integral part of the streaming analytics platform that Hortonworks is building.  For example, when a streaming developer creates a stream by connecting to a Kafka broker and selecting a Kafka topic,  it is important that the tool connect to the shared Schema Registry and retrieve the schema. This is the schema that the developer will use to implement analytics (joining streams, aggregations over windows, CEP, etc..)  The below shows a glimpse of how this integration would work.  

 

  • Apache Atlas Integration (Coming in the future) – Integrating Atlas and Schema Registry allows schemas for Kafka topics to be discoverable data set similar to Hive tables addressing various governance, audit and security requirements. 
  • Apache Ranger Integration (Coming in the future) – Ranger policies can deny a given user/group access to a Kafka topic at the field level using tag based policies via Atlas and Ranger integration with the Schema Registry.

 

What’s Next?

The next blog of this series will provide more details on the new streaming analytics platform that Hortonworks has been working on for the last 8 months. Stay Tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *