Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 10, 2015
prev slideNext slide

How Spark and Open Enterprise Hadoop Drive Business Value at WebTrends

Open Enterprise Hadoop is already transforming many industries, accelerating Big Data projects to help businesses translate information into competitive advantage.

I’d like to share a real-world example from the digital marketing powerhouse Webtrends, who’ve used the Hortonworks solution to launch a powerful new product line. First, a little context.

Everywhere you look, you can find companies using Open Enterprise Hadoop in large-scale projects to enable deep data discovery, to capture a single view of customers across multiple data sets, and to help data scientists perform predictive analytics.

In these ways, companies meet current customer needs, anticipate shifting market dynamics and consumer behaviors, and test business hypotheses—all crucial capabilities to help them outmaneuver and outperform their competitors.

The booming demand for Big Data has fueled a dizzying rise in spending on the technologies that make it possible, and Hortonworks leads the Hadoop market. In a recent CIO survey, our platform was named as a top imperative for IT spending with top net spending intentions among those CIOs for both the Analytics/BI/Big Data and Data Warehousing sectors among both S&P 500 and Global 1000 respondents.

Let’s dig a little deeper into a few of the reasons that Hortonworks has become synonymous with Open Enterprise Hadoop success.

One of the most active and remarkable open source projects in the Apache Software Foundation is Apache SparkTM, which makes it possible to run programs up to 100X faster than MapReduce using an advanced DAG (directed acyclic graph) execution engine that supports cyclic data flows and in-memory computing.

Spark is also developer-friendly and leverages Java, Scala, Python and R with 80 high-level operators that make it easy to build parallel apps. Since Spark combines SQL, streaming and complex analytics, it offers broad compatibilities within multiple tools—a key advantage for running analytics against diverse data sources.

Apache Spark has generated a lot of excitement in the Big Data community, inspiring contributions by more than 400 developers since the project started in 2009, and Spark is a natural complement to YARN, which allows multiple data processing engines to interact with data stored in a single platform.

Apache Hadoop YARN unlocks an entirely new approach to Big Data analytics, and Spark is a key pillar in that approach.

Both Spark and YARN are ideally suited to the strategy that Hortonworks has followed since our inception: to enable a modern data architecture that allows users to store data in a single location and interact with it in multiple ways, using whichever data processing engine best matches the analysis.

We integrated Spark into Hortonworks Data Platform (HDP) to make it easy for our customers to apply consistent governance, security and management policies to Spark in the same way that they can for the other data processing engines within HDP (such as Hive, HBase, Storm or Solr).

This cross-stack integration makes Spark on YARN in HDP one of the best options for unlocking the value of large-scale Big Data repositories and extracting rich insights from a data lake, and our enterprise customers want to take full advantage of YARN’s unique power for multi-tenant Big Data analysis. Now data scientists can substantiate machine-learning insights from Spark with interactive insights from Apache Hive or real-time insights from Apache Storm (to name just two of the multiple engines managed by YARN).

But Hortonworks has always known that broad Hadoop adoption requires not just powerful analytics but also enterprise-grade services for operations, data security and governance.

As part of our integration of Spark into HDP, we’ve met those requirements by making key security and operational capabilities available to our customers that plan to use Spark in combination with other processing engines.

Spark Security in HDP

Typically, our customers begin their journey with Spark use cases on HDP clusters that either don’t contain sensitive data, or are dedicated for a single application, meaning that they aren’t subject to broad security requirements. They’re relatively self-contained.

Then, with early Spark successes under their belts, those customers want to capture YARN’s unique multi-tenant value. They seek to deploy Spark-based applications alongside other applications in a single cluster, but with this deeper integration they need to meet higher security standards.

Spark Operations in HDP

Hortonworks also helps customers streamline operations for Spark by integrating it with 100% open source Apache Ambari, which is backed by numerous Hortonworks partners including Microsoft, Teradata, Pivotal and HP.

Ambari provisions, manages and monitors HDP clusters, and our partners use Ambari Stacks to rapidly define new components and services, then add them within a Hadoop cluster. With Stacks, Spark components and services can be managed by Ambari to install, start, stop and fine-tune a Spark deployment through a single interface, across every engine in your Hadoop cluster.

To simplify the operational experience, HDP 2.2.4 also allows administrators to install and manage Spark with Apache Ambari 2.0.

Spark at Webtrends

That’s the backstory—now, here’s the case study I promised about how Webtrends is putting all this to work today.

As a provider of web analytics products, services and solutions for more than 2,000 enterprises, Webtrends processes more than 13 billion online events every day.

Before working with Spark as part of HDP, the company was up against challenges in three key areas:

  • Cost – Storage and processing didn’t scale economically, leaving Webtrends exposed to rising costs as its business grew.
  • Duplication – Unable to leverage cloud-based processing, Webtrends had to maintain two separate clusters—one for Spark and another for Hadoop.
  • Analytics – With only a retrospective view of data, the company had limited predictive capabilities, hampering Big Data’s strategic value to anticipate emerging market trends and customer needs.

Now Webtrends uses HDP to run Spark on YARN, leading to dramatic improvements in each of these areas.

  • Cost – By greatly improving storage and processing scalability, Webtrends to cut its costs by 20–40 percent while simultaneously adding 500 terabytes of new data per quarter. The company is approaching 1.5 petabytes stored in its HDP data lake.
  • Duplication – By deploying HDP in the cloud, Webtrends to unified its two clusters into one that supports both Spark and Hadoop.
  • Analytics – Spark now processes 13 billion events per day for Webtrends at a blistering analytical pace of 40 milliseconds per event.

That’s why Hortonworks has worked consistently to integrate Spark with the security constructs of the broader Hadoop platform, ensuring that Spark can run on a secure Hadoop cluster and leverage the authorization offered by HDFS. We’ve also worked within the community to ensure that Spark runs on a Kerberos-enabled cluster, so that only authenticated users can submit Spark jobs. We outlined our broader security strategy in a September 2014 blog post, “Extending Spark on YARN for Enterprise Hadoop”, and many results from subsequent execution are now available in HDP 2.3.

And Webtrends keeps pressing its Hadoop-related advantage. As Peter Crossley, the company’s director of architecture explained recently:

One of the things that Webtrends is working on right now with Hortonworks is the ability to take Spark and Hive and Hadoop…to be able to execute these jobs in parallel.

Watch this video to learn more about how Webtrends and Hortonworks work together:

Based on its leadership with Spark and Hadoop, Webtrends even launched a new product earlier this year. Webtrends Explore™ allows marketers and analysts to dig deep into customer data to gain a complete understanding of online behaviors. This single view of online customers gives marketers the self-service opportunity to engage their customers in the right place, in the right way, at the right time.

We hear these stories throughout the Hortonworks customer community, as enterprises of all kinds leverage the power of open source development, combined with key value-added capabilities in HDP, to create new advanced analytic applications.

We’ll keep bringing you more technology updates and case studies on the art and science of Open Enterprise Hadoop to share lessons that you can use to keep your business at the forefront of Big Data transformation.

Learn More About Spark in Hortonworks Data Platform

About the Author

mm Matthew Morgan is the vice president of product and alliance marketing for Hortonworks. In this role, he leads Hortonworks product marketing, alliance marketing, vertical solutions marketing, and worldwide sales enablement. His background includes twenty years in enterprise software, including leading worldwide product marketing organizations for Citrix, HP Software, Mercury Interactive, and Blueprint. Feel free to connect with him on LinkedIn or visit his personal blog.



Paul Zikopoulos says:

In this blog posting you wrote “Apache Ambari, which is backed by numerous Hortonworks partners including Microsoft, Teradata, Pivotal and HP.” I would say it’s backed by anyone that is part of the Open Data Platform (ODP) no? (IBM, Verizon, SAS, Splunk, +++) correct?

Hari Sekhon says:

Ha ha I believe Paul is correct, had to get IBM in there 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *