cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
February 17, 2016
prev slideNext slide

Spark Summit: Accelerating Enterprise Spark

I had the pleasure to speak at Spark Summit in New York today about accelerating the adoption of Spark by mainstream enterprises. I had to admit at the beginning of my talk that I’m an “open source addict” — over the past 12 years I’ve been blessed to have called JBoss, Red Hat, SpringSource, and Hortonworks home. My focus has been the same at each stop: how can we innovate in open source technology and deliver enterprise-scale, easy to use products and solutions that can be consumed by mainstream enterprises?

While I’m excited to talk about the technology itself, it’s always important to root the conversation in why enterprises should care. In the case of Apache Spark, the simple answer is: because Spark helps unlock the enormous potential of data for the enterprise.

I have had the pleasure to work with the team at Webtrends and they are a great example of exactly what I mean. They adopted Hadoop and Spark a while ago, and they consolidated their Spark and Hadoop clusters into one YARN-based HDP cluster where they run Spark on YARN in the Hortonworks Data Platform (HDP) as one of many workloads. The company is approaching 1.5 petabytes stored in its HDP data lake. Spark now processes 13 billion events per day. What I find most compelling is that this modern data architecture enabled them to introduce a new product offering called Webtrends Explore which allows their customers to dive deep into their data and gain the flexibility of answering important business questions immediately. You can learn more about Webtrends use cases and journey by watching the video here.

 

Screen Shot 2016-02-17 at 3.11.04 PM

One of the other examples I presented is how a railroad company is using HDP and Spark to deliver a realtime view of the state-of-the-train-tracks. Video images and geolocation are key data elements in the solution that’s focused on preventing accidents before they occur. If this example doesn’t underscore the fact that the age of data has truly arrived for any type of business, then I’m not sure what will.
So with that as context, what are the macro trends we’re seeing?

Screen Shot 2016-02-17 at 3.11.41 PM

First, Spark is becoming the defacto data API for many big data processing workloads. To date for analytics and reporting and more recently for workloads like ETL and streaming. It’s become one of the key tools in the toolbox and an important element in a modern data architecture.

Second, Spark is getting broad adoption in the enterprise.  A series of use cases are developing rapidly.  For example using Spark as a query federation engine, or with HDP ecosystem projects such as Hive and HBase.   Any new apps will likely be built on Spark.  But missing enterprise capabilities is still key.   That’s where we can bring our expertise to bear.  
Third, agile analytic development and data science still remains the frontier.  We need to democratize Spark to not only for those who know Scala, Java, Python, and R but to the broadest community of “developers” possible. We need better tooling for professional developers as well as business “developers.” We need to encourage universities to pay attention to this movement, and we need to reach out to undergrads and encourage them to learn Spark and/or tools that ride atop.

In light of this, Hortonwork’s strategy is threefold in relation to Apache Spark:

Screen Shot 2016-02-17 at 3.12.25 PM

#1: Make agile analytic development and data science easier and more productive.  Highlights include:

  • Apache Zeppelin: a web-based notebook for agile analytic development. This open source tool provides a visual interactive experience for uncovering insights and sharing those insights with others.
  • Magellan: an open source library for Geospatial Analytics that uses Spark as the underlying execution engine. Geospatial data is pervasive in mobile devices, sensors, logs, and wearables.  If you are working with geospatial data and big data sets that need spatial context, there are limited open source tools that make it easy for you to parse and query at scale, which makes this hard for business intelligence and predictive analytics apps.  Magellan facilitates geospatial queries and builds upon Spark to address the hard problems of dealing with geospatial data at scale.

#2: Accelerate capabilities that harden Spark for enterprise use.  In areas ranging from encryption and security, data governance, HA, DR, operations and debugging.  We’re also improving data integration with things like RDD caching in HDFS,  and providing a unified Hive and Spark connector for HBase that eliminates complexity and improves overall performance.

#3: Continue to innovate at the core.  We want to make this the best experience and performance possible with HDP.  No secret sauce.  All open and all going back into the community. This includes enhanced support for YARN with dynamic executor allocation support in HDP so Spark runs better within multitenant YARN clusters. We’ve also been quietly working with the talented folks at Hewlett Packard Labs on providing an optimized Spark experience at the core. I can’t go into details now, but I encourage you to tune in on March 1st!

The pace of innovation in the Spark community is moving fast, and we plan on staying in lock step with the community. For example within a few hours of the community release of Spark 1.6, we made a technical preview available for deployment on our current version of HDP, and we’re marching quickly to GA.

We live in an age where every business is a data business. Tomorrow’s leaders are already mastering the value of data and embracing an open approach. If you’re just getting started, don’t be shy. Join the community and be part of this journey.

——

Shaun Connolly

@shaunconnolly

Comments

  • Leave a Reply

    Your email address will not be published. Required fields are marked *