cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
June 03, 2015
prev slideNext slide

Apache Spark on HDP: Learn, Try and Do

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.

In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on HDP, with a set of tutorials.

Spark on HDP

Spark on Apache Hadoop YARN enables deep integration with Hadoop and allows developers and data scientists alike two modes of development and deployment on Hortonworks Data Platform (HDP).

In the local mode—running on a single mode, on an HDP Sandbox— you can get started using a set of tutorials put together by my colleague Saptak Sen.

  1. Hands on Tour with Apache Spark in Five Minutes. Besides introducing basic Apache Spark concepts, this tutorial demonstrates how to use Spark shell with Python. Often, simplicity does not preclude profundity. In this simple example, a lot is happening behind the scenes and under the hood but it’s hidden from the developer using an interactive Spark shell. If you are a Python developer and have used Python shell, you’ll appreciate the interactive PySpark shell.
  2. Interacting with Data on HDP using Scala and Apache Spark. Building on the concepts introduced in the first tutorial, this tutorial explores how to use Spark with a Scala shell to read data from an HDFS file, perform in-memory transformations and iterations on a RDD, iterate over results, and then display them inside the shell.
  3. Using Apache Hive with ORC from Apache Spark. While the last two tutorials explored reading data from HDFS and computing in-memory, this tutorial shows how to persist data as Apache Hive tables in ORC format and how to use SchemaRDD and Dataframes. Additionally, it shows how to query Hive tables using Spark SQL.

What’s Next?

Our commitment to Apache Spark is to ensure it’s YARN-enabled and enterprise-ready with security, governance, and operations, allowing deep integration with Hadoop and other YARN enabled workloads in the enterprise—all running under the same Hadoop cluster, all accessing the same dataset.

We continue with that steadfast strategy. Last month, we released a technical preview of Apache Spark 1.3.1 on HDP 2.2. Shortly, we’ll follow with a 1.3.1 GA.

Learn More

Comments

  • It looks like you’ve got an invalid link for “Using Apache Hive with ORC from Apache Spark”.

    • Hello Maciek,

      Whenever we release a version of Spark on HDP, we ensure that it’s enterprise ready. That means, we want our customers to easily deploy and manage it via Ambari, we want them to integrate security aspects via Ranger, we want it integrate support access formats in Hive, and we want to ensure it run as a first-class citizen on YARN. All these imperative requirements for enterprise take few months to test, validate, and for general availability. Hence, our rolling out of technical previews, followed by general availability.

      In due time, we will have 1.4.1 as a technical review on HDP.

        • I can’t provide any specific dates. However, it’s on your release maps to offer HDP developers access to tech previews as early is possible for us.

  • Trying to copy the ‘littlelog.csv’ file to tmp using the ”
    hadoop fs -put ./littlelog.csv /tmp/” command at the command line. I have placed the csv file in several locations (home and the hdfs directory). Keep getting an error message saying that “No such file or directory exists’.

    Any suggestions on what I am doing wrong?

    • You need to copy and past the lines shown in the tutorial into a file called ‘littlelog.csv,’ and then use the command to copy into the HDFS location /tmp.

  • Thanks for the tutorials. These deal with interactive shells. Is there any example of creating a simple batch spark application, which can run in HDP in a batch (non interactive mode). Typically something like reading a file/hive table (ORC?) – some data processing / transformation – creating output in a file. Thanks.

  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    If you have specific technical questions, please post them in the Forums

    You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>