YARN changed the game for all data access engines in Apache Hadoop. As part of Hadoop 2, YARN took the resource management capabilities that were in MapReduce and packaged them for use by new engines. Now Apache Storm is one of those data-processing engines that can run alongside many others, coordinated by YARN.
YARN’s architecture makes it much easier for users to build and run multiple applications in Hadoop, all sharing a common resource manager. We see those applications arising and are excited by the additional opportunities that brings for Storm.
Now let’s talk about Apache Storm…
Apache Storm is a distributed, fault tolerant, and highly scalable platform for processing streaming data. Storm supports a wide range of use cases, including real-time analytics, machine learning, continuous computation, and more. It is also extremely fast, with the ability to process over a million records per second per node on a modest sized cluster.
With the explosion of data sources in recent years, many Apache Hadoop users have recognized the necessity to process data in real time while also maintaining traditional batch and interactive data workflows. Apache Storm fills that real-time role and offers tight integration with many of the tools and technologies commonly found in the Hadoop ecosystem.
Nathan Marz originally developed storm while he was at BackType (later acquired by Twitter), and it was open-sourced in September of 2011. In September 2013, Storm entered the Apache Software Foundation (ASF) as an incubator project. Since then, the Storm community has flourished and grown significantly, delivering a number of software releases in accordance with the strict licensing guidelines required of any Apache project.
These rules are important to both software developers and users for the following reasons:
We are pleased to say that over the course of the past year, the Apache Storm community has demonstrated the ability to adhere to these requirements, and is finalizing the steps necessary to graduate to an Apache Top Level Project.
Like any promising student on the verge of graduation, the Apache Storm community is looking ahead to a bright future. So what can we expect of Apache Storm?
YARN fundamentally changed Hadoop resource management and enabled the deployment of new services in the Hadoop infrastructure. Many users are eager to see Apache Storm take advantage of these YARN capabilities.
There are a number of efforts underway to bring YARN support to Apache Storm. In the near term, Apache Slider will bring YARN support not just to Storm, but also to virtually any long-running application. This significantly lowers the barrier to adding YARN support to an existing application or framework.
In the longer term, we will see a much deeper, native integration between Apache Storm and YARN. There are countless opportunities in this area including dynamic, elastic scaling of YARN-based Storm clusters in response to resource utilization.
Today, you can think of a Storm cluster as an infrastructure to which you deploy applications (topologies). Tomorrow, this distinction will be blurred significantly, allowing developers and data scientists to focus more on their applications, and less on the infrastructure in which they run.
Much like the early days of Hadoop, Apache Storm originally evolved in an environment where security was not a high-priority concern. Rather, it was assumed that Storm would be deployed to environments suitably cordoned off from security threats. While a large number of users were comfortable setting up their own security measures for Storm, this proved a hindrance to broader adoption among larger enterprises where security policies prohibited deployment without specific safeguards.
Yahoo! hosts one of the largest Storm deployments in the world, and the engineering team recognized the need for security early on, so it implemented many of the features necessary to secure its own Apache Storm deployment. Yahoo!, Hortonworks, Symantec, and the broader Apache Storm community have been working on integrating those security innovations into the main Apache code base.
That work is nearing completion, and is slated to be included in an upcoming Apache Storm release. Some of the highlights of that release include:
In the future, you can expect to see further integration between Apache Storm and security-focused projects like Apache Argus (formerly XA Secure).
Apache Storm is already highly scalable and fast enough so that most common use cases require cluster sizes of less than 20 nodes, even with SLAs that require processing more than a million records per second. Some users, however, have experienced limitations when trying to deploy larger clusters in the range of several thousand nodes.
Each Apache Storm release has seen incremental improvement in terms of performance and scalability, and you can expect this trend to continue. In the future, Apache Storm will scale to several thousand nodes for those who need that level of real-time processing power.
Experienced Storm users will recognize that the Storm Nimbus service is not a single point of failure in the strictest sense (i.e. loss of the Nimbus node will not affect running topologies). However, the loss of the Nimbus node does degrade functionality for deploying new topologies and reassigning work across a cluster.
Upcoming releases will eliminate this “soft” point of failure by supporting an HA Nimbus. Multiple instances of the Nimbus service run in a cluster and perform leader election when a Nimbus node fails.
Whether you are developing a batch processing, a stream processing, or a hybrid application (e.g. a Lambda architecture), higher-level abstractions and visualization tools enable developers and data scientists to more easily understand and manipulate data flows.
Tools such as Pig, Cascading, and the SQL on Hadoop ecosystem are excellent examples of how those abstractions can significantly improve productivity. This applies equally to stream processing. In the future we will see many of the same concepts implemented on top of Storm, along with more advanced tools to visualize and manage streaming data flows.
We will also see broader language support for Storm. Apache Storm has always embraced polyglot programming to better support the languages with which developers are most comfortable and productive.
The recent surge of interest in Apache Spark and the fledgling Spark Streaming project have raised awareness of the concept of micro-batching, where real-time data streams are partitioned into smaller “mini batches” for processing. With micro-batching, you trade the inherent low latency of one-at-a-time processing for higher throughput (and higher latency).
One of the interesting properties of micro-batching is that given the right primitives, it’s possible to implement exactly-once processing, which is just what Storm’s Trident API does.
Storm has supported both the one-at-a-time as well as micro-batch processing (via the Trident API) since (pre-Apache) version 0.8.0. Trident differentiates Storm from other stream processing frameworks because it supports Distributed Remote Procedure Calls (DRPCs). DRPCs leverage Storm’s inherent parallelism in synchronous request-response invocations. In the future you can expect to see broader adoption of Storm’s Trident API, with API enhancements and improved tooling and language support.
The future of Apache Storm is very bright indeed, with exciting times ahead. The Storm community is growing, and we encourage anyone interested to participate by contributing to the project.