Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
July 14, 2016
prev slideNext slide

Microbenchmarking Apache Storm 1.0 Performance

Significant Throughput and Latency Gains Between Apache Storm 0.9 and 1.0

The release of version 1.0 marks another major milestone for Apache Storm. Since becoming an Apache project in Sept 2013, much work has gone into maturing the feature set and also improving performance by reworking or tweaking various components. (See A Brief History of Apache Storm)

Some of the notable changes that contribute to improved performance are:

  1. Switch from ZeroMQ to Netty for inter-worker messaging.
  2. Employing batching in Disruptor queues (used for intra-worker messaging)
  3. Optimizations in the Clojure code such as employing type hints and reducing expensive Clojure lookups in perf sensitive areas.

In this blog we shall take a look at performance improvements in Storm since its incubation into Apache. To quantify this, we compare the performance numbers of Storm v.0.9.0.1, which was the last pre-apache release, with the most recent Storm v1.0.1. Storm v.0.9.0.1 has also been used as a reference point for performance comparisons against Heron.

Given the existence of recent efforts to benchmark Storm “at scale”, here we shall examine performance from a different angle. We narrow the focus to some specific core areas of Storm using a collection of simple topologies. To contain the scope, we have limited the scope to Storm core (i.e. no Trident).

Methodology

Each topology was given at least 4 mins of “warm up” execution time before taking measurements. Subsequently, after a minimum of 10 minutes, metrics were captured from the Web UI for the last 10-minute window. The captured numbers have been rounded off for readability. In all cases ACK-ing was enabled with 1 ACKer bolt executor.  Throughput (i.e tuples/sec) was calculated by dividing the total ACKs for the 10 min window by 600.

Due to some backward incompatibilities (mostly namespace changes) in Storm, two versions of the topologies were written, one for each Storm version. As a general principle we have avoided configuration tweaks to tune performance and stayed with default values.  The only config setting we applied was to set the max heap size of the worker to 8GB to ensure memory.

Setup:

Cluster:

–       5-node cluster (1 nimbus and 4 supervisor nodes) running Storm v0.9.0.1

–       5-node cluster (1 nimbus and 4 supervisor nodes) running Storm v1.0.1

–       3-node Zookeeper cluster

Hardware:

All nodes had the following configuration:

  • CPU:  sockets = 2 sockets, 6 cores per socket, Hyper threaded. Model  (2 sockets x 6 cores x 2 hyper threads = 24).  Intel Xeon CPU E5-2630 0 @ 2.30GHz
  • Memory : 126 GB
  • Network: 10 GigE
  • Disk: 6 disks. Each 1TB. 7200 RPM.

1. Spout Emit Speed:

Here we measure how fast a single spout can emit tuples.

Topology: This is the simplest topology. It consists of a ConstSpout that repeatedly emits the string “some data” and no bolts. Spout parallelism is set to 1, so there is only one instance of the spout executing.  Here we measure the number of emits per second. Latency is not relevant as there are no bolts.

Spout Emit Speed: How fast a single spout can emit tuples
Storm version (Links to Topology) Emits Rate Measurements
v0.9.0.1 108 k tuples/sec
v1.0.1 3.2 million tuples/sec

2. Messaging Speed (Intra-worker):

Measure the speed at which tuples can be transferred between a spout and a bolt running within the same worker process.

Topology: Consists of a ConstSpout that repeatedly emits the string “some data” and a DevNull bolt which ACKs every incoming tuple and discards them. The spout, bolt and acker were given 1 executor each.  The spout and bolt were both run within the same worker.

Intra-Worker Messaging: Speed of tuple transfer between a spout and a bolt within the same worker process
Storm version (Links to Topology) Throughput Latency
v0.9.0.1 87k/sec 16ms
v1.0.1 233k/sec 3.4ms

3. Messaging Speed (Inter-Worker 1):

The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes on the same machine.

Topology: Same topology as the one used for Intra-worker messaging speed.  The spout and bolt were however run on two separate workers on the same host. The bolt and the acker were observed to be running on the same worker.

Inter-Worker 1 Messaging Speed: Spout and Bolt run on two separate works on same host, bolt and acker on the same worker
Storm version (Links to Topology) Throughput Latency
v0.9.0.1 48 k/sec 170 ms
v1.0.1 287 k/sec 8 ms

4. Messaging Speed (Inter-worker 2):

The goal is to measure the speed at which tuples are transferred when the spout, bolt and acker are all running on separate worker processes on the same machine.

Topology: Same topology as the one used for Intra-worker messaging speed.  The spout, bolt and acker were however run on three separate workers on the same host.

Inter-Worker 2 Messaging Speed: Tuples transferred when spout, blot and acker all running on separate worker process on the same machine
Storm version (Links to Topology) Throughput Latency
v0.9.0.1 43 k/sec 116 ms
v1.0.1 292 k/sec 8.6 ms

5. Messaging Speed (Inter-host 1):

The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes running on (2) different machines.

Topology: Same topology as the one used for Intra-worker messaging speed but the spout and bolt were run on two separate workers on two different hosts. The bolt and the acker were observed to be running on the same worker.

Interhost Messaging Speed: Speed Between a Spout and a Bolt, each on a different worker process on 2 different machines
Storm version (Links to Topology) Throughput Latency
v0.9.0.1 48 k/sec 845 ms
v1.0.1 316 k/sec 13.3 ms

6. Messaging Speed (Inter-host 2):

Here we measure the speed at which tuples are transferred when the spout, bolt and acker are all running on separate worker processes on 3 different machines.

Topology: Again same topology as inter-host 1, but this time the acker run on a separate host.

Inter-Host Messaging Speed: 2nd Scenario: Tuples transferred when the spout, bolt and acker are all running on separate worker processes on 3 different machines
Topology Code (Github Links) Throughput Latency
v0.9.0.1 50 k/sec 1.7 seconds
v1.0.1 303 k/sec 7.4 ms

Summary

Throughput (tuples/sec)

Spout Emit

MSGing

IntraWorker

MSGing

InterWorker 1

MSGing

InterWorker 2

MSGing

InterHost 1

MSGing

InterHost 2

v0.9.0.1 108,000 87,000 48,000 43,000 48,000 50,000
v1.0.1 3,200,000 233,000 287,000 292,000 316,000 303,000
Speedup 2863% 168% 498% 579% 558% 506%

Apache Storm 1 MicroBenchmarking Throughput Hortonworks

Latency (milliseconds)

Version MSGing Intra-worker MSGing Inter-worker 1 MSGing Inter-worker 2 MSGing Inter-host 1 MSGing Inter-host 2
v1.0.1 3 8 9 13 7
v0.9.0.1 16 170 116 845 1,700
speedup 5x 21x 13x 65x 242x

Apache Storm 1 MicroBenchmarking Latency Hortonworks

The changes mentioned at the beginning of this blog are major contributors to the improvements we see here. The impact of switching from ZeroMQ to Netty would be visible when there is more than 1 worker (which is often the case) while the other changes should impact almost any topology. Additionally, introduction of Pacemaker process to optionally handle heartbeats for alleviating the load on Nimbus is worth noting although it doesn’t impact raw performance of individual worker, but helps scaling topologies on large clusters.

Numbers suggest that Storm has come a long way in terms of performance but it still has room go faster. Here are some of the broad areas that should improve performance in future:

  • An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.
  • Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about the way in which topologies are scheduled on the cluster.
  • Based on microbenchmarking and discussions with other Storm developers there appears potential for streamlining the internal queueing for faster message transfer.
  • Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improve throughput.

Comments

    • As mentioned in the methodology section

      “metrics were captured from the Web UI for the last 10-minute window”

      “Throughput (i.e tuples/sec) was calculated by dividing the total ACKs for the 10 min window by 600.”

      • Hi, Roshan.
        Thank you for reply.
        I am doing the same measurements with storm 1.0.1 version.
        However, I did not get your numbers. I used “wordcount example” in storm and I got very low number compared to yours. For example, one spout and one bolt in different workers showed “~17000 tuple/sec” in spout (This is inter-worker story).
        So, I just wonder what make it different like configuration or workload.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *