Stream Data Processing

Bringing Stream Data Processing to Hortonworks Data Platform

YARN opened up Hadoop for data access by applications other than MapReduce. One of the most commonly demanded use cases was the antithesis of batch: stream processing in Hadoop. Apache Storm is a fully certified component of HDP, and our customers are using stream processing for real-time analysis of some of the most common new types of data such as sensor and machine data.

Initiative Goals

Streams in HDP
Bringing stream data processing to enterprise Apache Hadoop and Hortonworks Data Platform.
Storm on YARN
Use the YARN Hadoop operating system to allow multiple workloads to be applied to Hadoop data simultaneously.
Enterprise Readiness
Bring baseline high availability, management, authentication and advanced scheduling to Storm.

Already Delivered

The team at BackType/Twitter originally conceived Storm to analyze the tweet stream in real time. Storm became an official Apache incubation project in September 2013. Hortonworks engineering is deeply committed to integrate Storm with Hadoop.

Beginning with Hortonworks Data Platform version 2.1, Apache Storm is a fully-certified component of HDP. The current version of Storm:

  • Replaces 0MQ data transport with pure Java netty-based transport
  • Eliminates the challenge of installing the 0MQ native binaries
  • Includes built-in support for Windows
  • Uses Ambari for simplified installation and management of clusters
  • Improves connectivity with Kafka, HBase and HDFS
  • Is easily monitored with Ganglia and Nagios in Ambari

Coming Next

Phase 2

  • Improved resource utilization with Apache Storm running in YARN
    • Share cluster resources between Storm and Hadoop
    • Minimize data movement
  • Integration with JMS-based enterprise tools
  • Results easily sent to operational dashboards powered by EDWs and RDBMS’
  • Improved security with Kerberos authentication and ACLs for restricted access to topology data
  • Immediate availability of streaming results for interactive query with Apache Hive

Phase 3

  • Simplified HA setup & monitoring with Ambari
  • Scheduler improvements for better load management & SLA guarantees
  • Ambari for topology management & monitoring
  • Simplified topology development with declarative wiring of data sources, for re-use and sharing

Find more discussion here…

Essential Timeline

Phase 1
  • Manage & Monitor via Ambari
  • Kafka, HBase & HDFS Connectors
  • Windows Support
DeliveredStorm 0.9.1(HDP 2.1)
Phase 2
  • Storm-on-YARN
  • Ingest & Notification for JMS
  • Data Persistance: EDWs & RDBMS’
  • Kerberos Support for Nimbus
  • User Authorization for Topologies
  • Hive Connector for Hive Table Updates
Coming Soon
Phase 3
  • Nimbus HA Management & Setup w/ Ambari
  • Advanced Scheduler
  • Ambari for Topology Management & Monitoring
  • Simplified topology development

Technical Resources

Try these Tutorials

Hortonworks Data Platform
The Hortonworks Data Platform is a 100% open source distribution of Apache Hadoop that is truly enterprise grade having been built, tested and hardened with enterprise rigor.
Get started with Sandbox
Hortonworks Sandbox is a self-contained virtual machine with Apache Hadoop pre-configured alongside a set of hands-on, step-by-step Hadoop tutorials.
Modern Data Architecture
Tackle the challenges of big data. Hadoop integrates with existing EDW, RDBMS and MPP systems to deliver lower cost, higher capacity infrastructure.