Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Sign up for the Developers Newsletter

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.


Get Started


Ready to Get Started?

Download sandbox

How can we help you?

* I understand I can unsubscribe at any time. I also acknowledge the additional information found in Hortonworks Privacy Policy.
closeClose button
August 13, 2015
prev slideNext slide

Microsoft and Hortonworks Do Spark in the Cloud

In this guest blog, Oliver Chiu, Microsoft’s product marketing manager for Hadoop/Big Data and Data Warehousing, explains how customers can benefit from deploying Apache Spark and HDP on Azure HDInsight for their enterprise and mission-critical big data jobs.

On July 10, Microsoft announced the public preview availability of Apache Spark for Azure HDInsight.

Azure HDInsight is Microsoft’s managed Hadoop-as-a-service offering. It takes the Hortonworks Data Platform (HDP) and architects it for the cloud. Customers get the benefits of Big Data without needing to procure hardware, install/tune, or maintain their own Hadoop clusters. By bringing Apache Spark to Azure HDInsight, we make Spark more easily accessible with the same benefits. HDInsight eliminates much of the heavy lifting associated with deploying, managing and executing tasks on Spark, thus raising the bar on what it means to process big data in the cloud.


For customers, we have seen three specific scenarios that Spark has been able to change the game:

  1. Make interactive queries over big data in Hadoop using BI tools or Open Source Notebooks
  2. Create a streaming solution for IOT or a real-time application
  3. Use machine learning algorithms to be able to predict outcomes in your analysis

Interactive queries over big data using BI tools or Open Source Notebooks

As more and more data is collected from a variety of sources, enterprises are anxious to get deep analytics about their business. With the release of Spark for HDInsight, analysts and BI professionals can analyze large unstructured data and build reports with their BI tool of choice or with open source notebooks (ie. Zeppelin or Jupyter).


Create a streaming solution for IOT or a real-time application

Beyond batch and interactive queries, Spark is also ideal for building real-time solutions that can solve for challenges like fraud detection, click stream analysis, financial alerts, telemetry from connected sensors and devices (IoT) and others. Spark streaming APIs can be used to write complex algorithms expressed with streaming functions like join and window. This makes Spark unique in its ability to handle both batch/interactive queries and streaming functions using the same common execution model.

Use machine learning algorithms to predict outcomes in your analysis

As part of Spark, customers will also have access to Spark MLib which is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. This will allow customers to incorporate predictive analytic capabilities as part of their application. As customers want to build more machine learning solutions, Azure Machine Learning is also an ideal solution for its easy-to-use experience and its ability to deploy a ML model in minutes as a fully managed web service.

Why choose Microsoft to run Spark?

Spark as an open source project in the Apache ecosystem has been gaining in popularity with many different offerings that support it. Microsoft has worked with Hortonworks to make a big bet on Spark by providing users with the best experience by putting the end user first, by hardening Spark for your mission critical application and by making Spark easy to deploy.

  • Enterprise hardening Spark for mission critical deployments: By integrating Spark with Azure, we are ensuring it’s ready to meet the demands of your mission critical deployments. Azure guarantees that you can run Spark with a 99.9% service level agreement at general availability to ensure continuity and protection against catastrophic events. Customers will have peace of mind with our 24/7 enterprise support and cluster monitoring to ensure you are always up and running. We have also enabled premium features not available in the open source Spark like concurrent queries. This allows multiple queries from one person or multiple queries from various users and Apps to share the same cluster resources. Finally, we allow you to externalize all of the metadata content and save your notebooks making the Spark cluster very close to stateless. This allows you to drop and recreate clusters and pick up where you left off.
  • Ease of deployment: With Spark for HDInsight, there’s no time-consuming installation or set up. Azure does it for you. You’ll be up and running in minutes and can deploy Spark without buying new hardware or other up-front costs. As you need to scale, Azure allows you to create larger clusters of any size to process big data on demand. The choice is yours as you can pick a VM type that makes use of a lot of SSDs or a VM type with large amounts of RAM. While you run Spark, you can choose to cache data either in memory or in SSDs. This allows you to easily adjust resources within the various Apps to optimize for certain workloads.

Deploy Spark in the Cloud with Azure HDInsight (Documentation)

 Deploy Spark On-Premises with Hortonworks Data Platform




Leave a Reply

Your email address will not be published. Required fields are marked *