HDP Analyst: Data Science

This course Provides instruction on the processes and practiceof data science, including machine learning and naturallanguage processing. Included are: tools and programminglanguages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy,Scikit-learn), the Natural Language Toolkit (NLTK), and SparkMLlib.

Duration

3 days

Prerequisites

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.

Target Audience

Architects, software developers, analysts and data scientists whoneed to apply data science and machine learning on Hadoop

Format

  • 50% Lecture/Discussion
  • 50% Hands-on Labs

Course Objectives

At the completion of the course students will be able to:

  • Recognize use cases for data science
  • Describe the architecture of Hadoop and YARN
  • Describe supervised and unsupervised learning differences
  • List the six machine learning tasks
  • Use Mahout to run a machine learning algorithm on Hadoop
  • Use Pig to transform and prepare data on Hadoop
  • Write a Python script
  • Use NumPy to analyze big data
  • Use the data structure classes in the pandas library
  • Write a Python script that invokes SciPy machine learning
  • Describe options for running Python code on a Hadoop cluster
  • Write a Pig User-Defined Function in Python
  • Use Pig streaming on Hadoop with a Python script
  • Write a Python script that invokes scikit-learn
  • Use the k-nearest neighbor algorithm to predict values
  • Run a machine learning algorithm on a distributed data set
  • Describe use cases for Natural Language Processing (NLP)
  • Perform sentence segmentation on a large body of text
  • Perform part-of-speech tagging
  • Use the Natural Language Toolkit (NLTK)
  • Describe the components of a Spark application
  • Write a Spark application in Python
  • Run machine learning algorithms using Spark MLlib

Hands-on Labs

  • Setting Up a Development Environment
  • Using HDFS Commands
  • Using Mahout for Machine Learning
  • Getting Started with Pig
  • Exploring Data with Pig
  • Using the IPython Notebook
  • Data Analysis with Python
  • Interpolating Data Points
  • Define a Pig UDF in Python
  • Streaming Python with Pig
  • K-Nearest Neighbor and K-Means Clustering
  • Using NLTK for Natural Language Processing
  • Classifying Text using Naive Bayes
  • Spark Programming and Spark MLlib

Additional Information

Resources

Upcoming Courses

See our Schedule