HDP Analyst: Data Science

This course Provides instruction on the processes and practiceof data science, including machine learning and naturallanguage processing. Included are: tools and programminglanguages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy,Scikit-learn), the Natural Language Toolkit (NLTK), and SparkMLlib.


3 days


Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

Target Audience

Architects, software developers, analysts and data scientists whoneed to apply data science and machine learning on Hadoop


  • 50% Lecture/Discussion
  • 50% Hands-on Labs

Course Objectives

At the completion of the course students will be able to:

  • Recognize use cases for data science 
  • Describe the architecture of Hadoop and YARN 
  • Describe supervised and unsupervised learning differences
  • List the six machine learning tasks 
  • Use Mahout to run a machine learning algorithm on Hadoop 
  • Use Pig to transform and prepare data on Hadoop
  • Write a Python script 
  • Use NumPy to analyze big data 
  • Use the data structure classes in the pandas library 
  • Write a Python script that invokes SciPy machine learning 
  • Describe options for running Python code on a Hadoop cluster 
  • Write a Pig User-Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script 
  • Write a Python script that invokes scikit-learn 
  • Use the k-nearest neighbor algorithm to predict values 
  • Run a machine learning algorithm on a distributed data set 
  • Describe use cases for Natural Language Processing (NLP) 
  • Perform sentence segmentation on a large body of text 
  • Perform part-of-speech tagging 
  • Use the Natural Language Toolkit (NLTK) 
  • Describe the components of a Spark application 
  • Write a Spark application in Python 
  • Run machine learning algorithms using Spark MLlib 

Hands-on Labs

  • Setting Up a Development Environment 
  • Using HDFS Commands 
  • Using Mahout for Machine Learning 
  • Getting Started with Pig 
  • Exploring Data with Pig 
  • Using the IPython Notebook 
  • Data Analysis with Python 
  • Interpolating Data Points 
  • Define a Pig UDF in Python 
  • Streaming Python with Pig 
  • K-Nearest Neighbor and K-Means Clustering 
  • Using NLTK for Natural Language Processing 
  • Classifying Text using Naive Bayes 
  • Spark Programming and Spark MLlib 

Additional Information


Upcoming Courses

See our Schedule