-
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow developers to execute a variety of data intensive workloads. In this tutorial, we will use an Apache Zeppelin notebook for our development environment to keep things simple and elegant. Zeppelin will […]
Start
-
This tutorial will get you started with a couple of Spark REPL examples How to run Spark word count examples How to use SparkR You can choose to either use Spark 1.6.x or Spark 2.x API examples. Prerequisites This tutorial assumes that you are running an HDP Sandbox. Please ensure you complete the prerequisites […]
Start
-
This tutorial will get you started with Apache Spark and will cover: How to use the Spark DataFrame & Dataset API How to use SparkSQL Thrift Server for JDBC/ODBC access Interacting with Spark will be done via the terminal (i.e. command line). Prerequisites This tutorial assumes that you are running an HDP Sandbox. Please […]
Start
-
This tutorial will help you quickly spin-up a cloud environment where you can dynamically resize your cluster from one to hundreds of nodes. HDCloud is ideal for short-lived on-demand processing, allowing you to quickly perform heavy computation on large datasets. It gives you the ultimate control to allocate and de-allocate resources as needed. In […]
Start
-
Apache Zeppelin is a web-based notebook that enables interactive data analytics. With Zeppelin, you can make beautiful data-driven, interactive and collaborative documents with a rich set of pre-built language backends (or interpreters) such as Scala (with Apache Spark), Python (with Apache Spark), SparkSQL, Hive, Markdown, Angular, and Shell. With a focus on Enterprise, Zeppelin […]
Start
-
In this two-part lab-based tutorial, we will first introduce you to Apache Spark SQL. Spark SQL is a higher-level Spark module that allows you to operate on DataFrames and Datasets, which we will cover in more detail later. In the second part of the lab, we will explore an airline dataset using high-level SQL […]
Start
-
This is a very short tutorial on how to use SparkSQL Thrift Server for JDBC/ODBC access Prerequisites This tutorial assumes that you are running an HDP Sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Downloaded and Installed the Hortonworks Sandbox Reviewed Learning the Ropes of the Hortonworks Sandbox SparkSQL Thrift […]
Start
-
In this tutorial, we will explore how you can access and analyze data on Hive from Spark. In particular, you will learn: How to interact with Apache Spark through an interactive Spark shell How to read a text file from HDFS and create a RDD How to interactively analyze a data set through a […]
Start
-
In this brief tutorial you will run a pre-built Spark example on YARN Prerequisites This tutorial assumes that you are running an HDP Sandbox. Please ensure you complete the prerequisites before proceeding with this tutorial. Downloaded and Installed the Hortonworks Sandbox Reviewed Learning the Ropes of the Hortonworks Sandbox Pi Example To test compute […]
Start
-
This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Python, but Spark also supports development with Java, Scala, and R. The Scala version of this tutorial can be found here, and the Java version here. We’ll be using […]
Start
-
This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Scala, but Spark also supports development with Java, Python, and R. The Java version of this tutorial can be found here, and the Python version here. We’ll be using […]
Start
-
This tutorial will teach you how to set up a full development environment for developing and debugging Spark applications. For this tutorial we’ll be using Java, but Spark also supports development with Java, Python, and R. The Scala version of this tutorial can be found here, and the Python version here. We’ll be using […]
Start
-
In this tutorial, we will introduce you to Machine Learning with Apache Spark. The hands-on lab for this tutorial is an Apache Zeppelin notebook that has all the steps necessary to ingest and explore data, train, test, visualize, and save a model. We will cover a basic Linear Regression model that will allow us […]
Start
-
R is a popular tool for statistics and data analysis. It has rich visualization capabilities and a large collection of libraries that have been developed and maintained by the R developer community. One drawback to R is that it’s designed to run on in-memory data, which makes it unsuitable for large datasets. Spark is […]
Start
-
In this tutorial, we will introduce core concepts of Apache Spark Streaming and run a Word Count demo that computes an incoming list of words every two seconds. Prerequisites This tutorial is a part of series of hands-on tutorials to get you started with HDP using Hortonworks Sandbox. Please ensure you complete the prerequisites […]
Start
-
This is the third tutorial in a series about building and deploying machine learning models with Apache Nifi and Spark. In Part 1 of the series we learned how to use Nifi to ingest and store Twitter Streams. In Part 2 we ran Spark from a Zeppelin notebook to design a machine learning model […]
Start
-
This tutorial will teach you how to build sentiment analysis algorithms with Apache Spark. We will be doing data transformation using Scala and Apache Spark 2, and we will be classifying tweets as happy or sad using a Gradient Boosting algorithm. Although this tutorial is focused on sentiment analysis, Gradient Boosting is a versatile […]
Start