cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
cta
HDP Analyst: Data Science

cloud Upcoming Courses

Schedule

Overview

This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikit-learn), the Natural Language Toolkit (NLTK), and Spark MLlib.

Duration

3 days

Format

50% Lecture/Discussion
50% Hands on Labs

Prerequisites

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP Overview: Apache Hadoop Essentials course.


Target Audience


Architects, software developers, analysts and data scientists who need to apply data science and machine learning on Hadoop


Course Schedule

Hortonworks University provides an immersive and valuable real world experience in scenario-based training Courses. Our classes are available both in classroom or online, from anywhere in the world.

Course Objectives

At the completion of the course students will be able to:Recognize use cases for data scienceDescribe the architecture of Hadoop and YARN

icon6.png

Recognize use cases for data science

icon6.png

Describe the architecture of Hadoop and YARN

icon6.png

Describe supervised and unsupervised learning differences

icon6.png

List the six machine learning tasks

icon6.png

Use Mahout to run a machine learning algorithm on Hadoop

icon6.png

Describe the data science life cycle

icon6.png

Use Pig to transform and prepare data on Hadoop

icon6.png

Write a Python script

icon6.png

Use NumPy to analyze big data

icon6.png

Use the data structure classes in the pandas library

icon6.png

Write a Python script that invokes SciPy machine learning

icon6.png

Describe options for running Python code on a Hadoop cluster

icon6.png

Write a Pig User-Defined Function in Python

icon6.png

Use Pig streaming on Hadoop with a Python script

icon6.png

Write a Python script that invokes scikit-learn

icon6.png

Use the k-nearest neighbor algorithm to predict values

icon6.png

Run a machine learning algorithm on a distributed data set

icon6.png

Describe use cases for Natural Language Processing (NLP)

icon6.png

Perform sentence segmentation on a large body of text

icon6.png

Perform part-of-speech tagging

icon6.png

Use the Natural Language Toolkit (NLTK)

icon6.png

Describe the components of a Spark application

icon6.png

Write a Spark application in Python

icon6.png

Run machine learning algorithms using Spark MLlib

icon6.png

Take data science into production

Lab Content

icon6.png

Setting Up a Development Environment

icon6.png

Using HDFS Commands

icon6.png

Using Mahout for Machine Learning

icon6.png

Getting Started with Pig

icon6.png

Exploring Data with Pig

icon6.png

Using the IPython Notebook

icon6.png

Data Analysis with Python

icon6.png

Interpolating Data Points

icon6.png

Define a Pig UDF in Python

icon6.png

Streaming Python with Pig

icon6.png

K-Nearest Neighbor and K-Means Clustering

icon6.png

Using NLTK for Natural Language Processing

icon6.png

Classifying Text using Naive Bayes

icon6.png

Spark Programming and Spark MLlib

Certification

The demand for Big Data skills is increasing every day. Hortonworks offers a comprehensive Certification program to help establish your credentials. Get trained, Get Certified, Get Hired!

Hortonworks  University

Hortonworks University is your expert source for Apache Hadoop training and certification. Public and private on-site courses are available for IT professionals involved in implementing big data solutions.