Hello World! – An introduction to Hadoop with Hive and Pig

This Hadoop tutorial is from the Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series. The tutorials presented here are for Sandbox v2.0

The tutorials are presented in sections as listed below.

Overview of Apache Hadoop and Hortonworks Data Platform

The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform(HDP). It is packaged as a virtual machine to make evaluation and experimentation with HDP fast and easy. The tutorials and features in the Sandbox are oriented towards exploring how HDP can help you solve your business big data problems. The Sandbox tutorials will walk you through bringing some sample data into HDP and manipulate it using the tools built into HDP. The idea is to show you how you can get started and show you how to accomplish tasks in HDP. HDP is free to download and use in your enterprise and you can download it here: Hortonworks Data Platform
Download

Yahoo Data<br />
                Nodes

The Apache Hadoop projects provide a series of tools designed to solve big data problems. The Hadoop cluster implements a parallel computing cluster using inexpensive commodity hardware. The cluster is partitioned across many servers to provide a near linear scalability. The philosophy of the cluster design is to bring the computing to the data. So each datanode will hold part of the overall data and be able to process the data that it holds. The overall framework for the processing software is called MapReduce. Here's a short video introduction to MapReduce: Introduction to MapReduce

Yahoo Map<br />
                Reduce

Apache Hadoop can be useful across a range of use cases spanning virtually every vertical industry. It is becoming popular anywhere that you need to store, process, and analyze large volumes of data. Examples include digital marketing automation, fraud detection and prevention, social network and relationship analysis, predictive modeling for new drugs, retail in-store behavior analysis, and mobile device location-based marketing.


The Hadoop Distributed File System

In this section we are going to take a closer look at some of the components we will be using in the Sandbox tutorials. Underlying all of these components is the Hadoop Distributed File System(HDFS™). This is the foundation of the Hadoop cluster. The HDFS file system manages how the datasets are stored in the Hadoop cluster. It is responsible for distributing the data across the datanodes, managing replication for redundancy and administrative tasks like adding, removing and recovery of datanodes.

Apache Hive™

Hive UI<br />
                Overview

The Apache Hive project provides a data warehouse view of the data in HDFS. Using a SQL-like language Hive lets you create summarizations of your data, perform ad-hoc queries, and analysis of large datasets in the Hadoop cluster. The overall approach with Hive is to project a table structure on the dataset and then manipulate it with HiveQL. Since you are using data in HDFS your operations can be scaled across all the datanodes and you can manipulate huge datasets.

Apache HCatalog

HCat UI<br />
                Overview

The function of HCatalog is to hold location and metadata about the data in a Hadoop cluster. This allows scripts and MapReduce jobs to be decoupled from data location and metadata like the schema. Additionally since HCatalog supports many tools, like Hive and Pig, the location and metadata can be shared between tools. Using the open APIs of HCatalog other tools like Teradata Aster can also use the location and metadata in HCatalog. In the tutorials we will see how we can now reference data by name and we can inherit the location and metadata.

Apache Pig™

Pig UI<br />
                Overview

Pig is a language for expressing data analysis and infrastructure processes. Pig is translated into a series of MapReduce jobs that are run by the Hadoop cluster. Pig is extensible through user-defined functions that can be written in Java and other languages. Pig scripts provide a high level language to create the MapReduce jobs needed to process data in a Hadoop cluster.

That‘s all for now… let‘s get started with some examples of using these tools together to solve real problems!

Using HDP

Here we go! We're going to walk you through a series of step-by-step tutorials to get you up and running with the Hortonworks Data Platform(HDP).

Downloading Example Data

We'll need some example data for our lessons. For our first lesson, we'll be using stock ticker data from the New York Stock Exchange from the years 2000-2001. You can download this file here:

https://s3.amazonaws.com/hw-sandbox/tutorial1/NYSE-2000-2001.tsv.gz

The file is about 11 megabytes, and may take a few minutes to download. Fortunately, to learn 'Big Data' you don't have to use a massive dataset. You need only use tools that scale to massive datasets. Click and save this file to your computer.

Using the File Browser

You can reach the File Browser by clicking its icon:

Pick File<br />
                Browser

The File Browser interface should be familiar to you as it is similar to the file manager on a Windows PC or Mac. We begin in our home directory. This is where we'll store the results of our work. File Browser also lets us upload files.

Uploading a File

To upload the example data you just downloaded,

Select File<br />
                Upload

  • Select the 'Upload' button
  • Select 'Files' and a pop-up window will appear.
  • Click the button which says, 'Upload a file'.
  • Locate the example data file you downloaded and select it.
  • A progress meter will appear. The upload may take a few moments.

When it is complete you'll see this:

Uploaded Stock<br />
                Data

Now click the file name "NYSE-2000-2001.tar.gz". You'll see it, displayed in tabular form:

View Stock<br />
                Data

You can use File Browser just like your own computer's file manager. Next register the dataset with HCatalog.

Loading the sample data into HCatalog

Now that we've uploaded a file to HDFS, we will register it with HCatalog to be able to access it in both Pig and Hive.

Select the HCatalog icon in the icon bar at the top of the page:

Select<br />
                HCat

Select "Create a new table from file" from the Actions menu on the left.

HCat Create<br />
                Table

Fill in the Table Name field with 'nyse_stocks'. Then click on Choose a file button. Select the file we just uploaded 'NYSE-2000-2001.tsv.gz'.

HCat Choose<br />
                File

You will now see the options for importing your file into a table. The File options should be fine. In Table preview set all text type fields to Column Type 'string' and all decimal fields (ex: 12.55) to Column Type 'float.' The one exception is 'stock_volume' field should be set as 'bigint.' When everything is complete click on the "Create Table" button at the bottom.

HCat Define<br />
                Columns

A Short Apache Hive Tutorial

In the previous sections you:

  • Uploaded your data file into HDFS
  • Used Apache HCatalog to create a table

Apache Hive™ provides a data warehouse function to the Hadoop cluster. Through the use of HiveQL you can view your data as a table and create queries like you would in a database.

To make it easy to interact with Hive we use a tool in the Hortonworks Sandbox called Beeswax. Beeswax gives us an interactive interface to Hive. We can type in queries and have Hive evaluate them for us using a series of MapReduce jobs.

Let's open Beeswax. Click on the bee icon on the top bar.

Beeswax

On the right hand side there is a query window and an execute button. We will be typing our queries in the query window. When you are done with a query please click on the execute button. Note: There is a limitation of one query in the composition window. You can not type multiple queries separated by semicolons.

Since we created our table in HCatalog, Hive automatically knows about it. We can see the tables that Hive knows about by clicking on the Tables tab.

Tables

In the list of the tables you will see our table, nyse_stocks. Hive inherits the schema and location information from HCatalog. This separates meta information like schema and location from the queries. If we did not have HCatalog we would have to build the table by providing location and schema information.

We can see the records by typing Select * from nyse_stocks in the Query window. Our results would be:

Data<br />
                Table

We can see the columns in the table by executing describe nyse_stocks

NYSE

We will then get a description of the nyse table.

Describe NYSE<br />
                Table

We can count the records with the query select count(*) from nyse_stocks. You can click on the Beeswax icon to get back to the query screen. Evaluate the expression by typing it in the query window and hitting execute.

Select<br />
                Count

This job takes longer and you can watch the job running in the log. When the job is complete you will see the results posted in the Results tab.

Select Count<br />
                Results

You can select specific records by using a query like select * from nyse_stocks where stock_symbol="IBM".

Select<br />
                IBM

This will return the records with IBM.

Select IBM<br />
                Results

So we have seen how we can use Apache Hive to easily query our data in HDFS using the Apache Hive query language. We took full advantage of HCatalog so we did not have to specify our schema or location of the data. Apache Hive allows people who are knowledgable in query languages like SQL to immediately become productive with Apache Hadoop. Once they know the schema of the data can they quickly and easily formulate queries.

Pig Basics Tutorial

In this tutorial we create and run Pig scripts. On the left is a list of scripts that we have created. In the middle is an area for us to compose our scripts. We will also load the data from the table we have stored in HCatalog. We will then filter out the records for the stock symbol IBM. Once we have done that we will calculate the average of closing stock prices over this period.

The basic steps will be:

  • Step 1: Create and name the script
  • Step 2: Loading the data
  • Step 3: Select all records starting with IBM
  • Step 4: iterate and average
  • Step 5: save the script and execute it

Let's get started…

To get to the Pig interface click on the Pig icon on the icon bar at the top. This will bring up the Pig user interface. On the left is a list of your scripts and on the right is a composition box for your scripts.

A special feature of the interface is the Pig helper at the bottom. The Pig helper will provide us with templates for the statements, functions, I/O statements, HCatLoader() and Python user defined functions.

At the very bottom are status areas that will show the results of our script and log files

PIG<br />
                UI

Step 1: Create and name the script

  • Open the Pig interface by clicking the Pig icon at the top of the screen

    Image001-1

  • Title your script by filling in the title box

    Image001-2

Step 2: Loading the data

Our first line in the script will load the table. We are going to use HCatalog because this allows us to share schema across tools and users within our Hadoop environment. HCatalog allows us to factor out schema and location information from our queries and scripts and centralize them in a common repository. Since it is in HCatalog we can use the HCatLoader() function. Pig makes it easy by allowing us to give the table a name or alias and not have to worry about allocating space and defining the structure. We just have to worry about how we are processing the table.

  • On the right hand side we can start adding our code at Line 1
  • We can use the Pig helper at the bottom of the screen to give us a template for the line. Click on Pig helper -> HCatalog->load template
  • The entry %TABLE% is highlighted in red for us. Type the name of the table which is nyse_stocks.
  • Remember to add the a = before the template. This saves the results into a. Note the `= has to have a space before and after it.

Our completed line of code will look like:

a = LOAD ‘default.nyse_stocks’ USING org.apache.hcatalog.pig.HCatLoader()

Line<br />
                1

So now we have our table loaded into Pig and we stored it "a"

Step 3: Select all records starting with IBM

The next step is to select a subset of the records so that we just have the records for stock ticker of IBM. To do this in Pig we use the Filter operator. We tell Pig to Filter our table and keep all records where stock_symbol="IBM" and store this in b. With this one simple statement Pig will look at each record in the table and filter out all the ones that do not meet our criteria. The group statement is important because it groups the records by one or more relations. In this case we just specified all rather than specify the exact relation we need.

  • We can use Pig Help again by clicking on Pig helper->Relational Operators->FILTER template
  • We can replace %VAR% with "a" (hint: tab jumps you to the next field)
  • Our %COND% is "stock_symbol =='IBM'` " (note: single quotes are needed around IBM and don't forget the trailing semi-colon)
  • Pig helper -> Relational Operators->GROUP BY template
  • The first %VAR% is "b" and the second%VAR%is "all". You will need to correct an irregularity in the Pig syntax here. Remove the "BY`" in the line of code.
  • Again add the trailing semi-colon to the code.

So the final code will look like:

b = filter a by stock_symbol == 'IBM';
c = group b all;

Line<br />
                3

Now we have extracted all the records with IBM as the stock_symbol.

Step 4: Iterate and Average

Now that we have the right set of records we can iterate through them and create the average. We use the "foreach" operator on the grouped data to iterate through all the records. The AVG() function creates the average of the stock_volume field. To wind it up we just print out the results which will be a single floating point number. If our results would be used for a future job we can save it back into a table.

  • Pig helper ->Relational Operators->FOREACH template will get us the code
  • Our first %VAR% is c and the second %VAR% is "AVG(b.stock_volume);"
  • We add the last line with Pig helper->I/O->DUMP template and replace %VAR% with "d".

Our last two lines of the script will look like:

d = foreach c generate AVG(b.stock_volume);
dump d;

Line<br />
                5

So the variable "d" will contain the average volume of IBM stock when this
line is executed.

Step 5: Save the script and Execute it

We can save our completed script using the Save button at the bottom and then we can Execute it. This will create a MapReduce job(s) and after it runs we will get our results. At the bottom there will be a progress bar that shows the job status.

  • At the bottom we click on the Save button again
  • Then we click on the Execute button to run the script
  • Below the Execute button is a progress bar that will show you how things are running.
  • When the job completes you will see the results in the green box.
  • Click on the Logs link to see what happened when your script ran. The average of stock_volume This is where you will see any error messages. The log may scroll below the edge of your window so you may have to scroll down.

Final

Summary

Now we have a complete script that computes the average volume of IBM stock. You can download the results by clicking on the green download icon above the green box.

Answer

If you look at what our script has done, you see in Line 5 we:

  • Pulled in the data from our table using HCatalog, we took advantage that HCatalog provided us with location and schema information, if that needs to change in the future we would not have to rewrite our script.
  • Pig then went through all the rows in the table and discarded the ones where the stock_symbol field is not IBM
  • Then an index was built for the remaining records
  • The average of stock_volume was calculated on the records

We did it with 5 lines of Pig script code!

Feedback

We are eager to hear your feedback on this tutorial. Please let us know what you think. Click here to take survey

Comments

Sharma Ragi
|
October 29, 2014 at 10:54 pm
|

It is an awesome post to get started, i appreciate your efforts.. Thanks allot

Geetha
|
October 27, 2014 at 1:04 pm
|

Great tutorial. Small suggestion, for PIGs basic tutorial, in step 3, we dont need to choose the relational operator -> Group by instead we can choose the relational operator -> GROUP %VAR% ALL so that we can avoid the fixing step.

Max
|
September 24, 2014 at 1:58 pm
|

Looks like the tutorial is out of date. The Pig UI has changed quite a bit.

Matt Tucker
|
September 19, 2014 at 3:13 pm
|

Please update the nyse_stocks table to use “exchange_name” (or similar) instead of “exchange”, as it causes a parsing exception when doing “SELECT exchange FROM nyse_stocks”.

Ras
|
September 8, 2014 at 4:47 pm
|

I’m trying to find files using ssh console connection. I’d like to know the path of files created by hue using browser. I cannot find them (I’m accessing VM as root).

Saurabh Agrawal
|
September 5, 2014 at 8:20 am
|

Unable to interpret

c = GROUP b ALL;

Does this create indexes as per PIG’s syntax?

Eugine
|
August 27, 2014 at 1:13 am
|

This tutorial was great. Not too technical but just enough for myself to understand what Hadoop does. Can’t wait to dive myself into more examples and playing around. Thanks!

|
August 23, 2014 at 4:21 pm
|

Here’s how I fixed this one. You need to copy the jar file into the hive/lib directory. In the Sandbox vm in root type:

cp /usr/lib/hadoop/client/slf4j-api-1.7.5.jar /usr/lib/hive/lib

You should now be able to execute your pig script without this error

Sadashiv Dhulashetti
|
August 20, 2014 at 3:28 am
|

Is Hive and Pig is configured with Hortonworks+Sandbox+1.2+1-21-2012-1+vmware.ova? OR do I need to download and configure seperately ?

    |
    September 4, 2014 at 2:34 pm
    |

    Hive and Pig including many other components are pre-configured. You can start using these on the Sandbox as soon as you login.

shan
|
August 18, 2014 at 6:39 am
|

Nice tutorial – Hurrah!!! I did my hello world in Hadoop.

drussell
|
August 14, 2014 at 3:39 am
|

That message doesn’t represent a problem, you get it even on a successful run.

Jagatheesh
|
August 5, 2014 at 3:18 pm
|

Try This.

a = LOAD ‘default.nyse_stocks’ USING org.apache.hcatalog.pig.HCatLoader();
b = filter a by stock_symbol == ‘IBM';
c = group b all;
d = foreach c generate AVG(b.stock_volume);
dump d;

Subhash Gurav
|
July 26, 2014 at 1:31 pm
|

Excellent Tutorial for novice users. Appreciated the efforts behind.

Kemei Lan
|
July 19, 2014 at 8:31 pm
|

Nice simple tutorial! Went through all the setup and scripts smoothly. Thank you!

RAR
|
July 16, 2014 at 1:19 pm
|

Thanks very much – this took me from knowing nothing to being able to do something useful in a few minutes!

Satya
|
July 15, 2014 at 9:36 pm
|

Nice tutorial. Setup went thru without any issues. Gone thru these scripts without an error.

|
July 14, 2014 at 7:30 am
|

Beautifully explained, easy to understand Hadoop file system & PIG

Thank you so much

Mike H.
|
July 1, 2014 at 9:56 am
|

When I run the tutorial, the query history says the job succeeded, but there is no output. The logs have the following msg:

ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory

Richard Magahiz
|
July 1, 2014 at 9:10 am
|

You have to set the -useHCatalog parameter and hit return before running the pig example. See http://hortonworks.com/community/forums/topic/sandbox-pig-basic-tutorial-example-is-nbot-working/

Mike H.
|
June 27, 2014 at 10:55 am
|

Ran your example script for pig. Got no output. Kept cutting down number of lines hoping I’d get something. Got 0 bytes output. Got this in one of the logs.

ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory

Mike H.
|
June 27, 2014 at 10:53 am
|

Tried your example pig script. Job said it ran ok; but output file has 0 bytes. Doesn’t matter whether I do execute or check syntax.
Got the following error in the logs:
ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory

ravikanth
|
June 27, 2014 at 1:36 am
|

while creating a table ! i am getting an error
HCatClient error on create table: {“statement”:”use default; create table nyse_stocks(`exchange` string, `stock_symbol` string, `date` string, `stock_price_open` double, `stock_price_high` double, `stock_price_low` double, `stock_price_close` double, `stock_volume` bigint, `stock_price_adj_close` double) row format delimited fields terminated by ‘\\t';”,”error”:”unable to create table: nyse_stocks”,”exec”:{“stdout”:””,”stderr”:”which: no /usr/lib/hadoop/bin/hadoop in ((null))\ndirname: missing operand\nTry `dirname –help’ for more information.\n Command was terminated due to timeout(60000ms). See templeton.exec.timeout property”,”exitcode”:143}} (error 500)
how should i get rid of it ?

Mike H.
|
June 26, 2014 at 8:54 am
|

The tutorial was good, but setting up sandbox and using it proved very difficult. Executing the example in pig has been fruitless. Got thru all the previous tutorial steps, but when it came to pig, I got an empty file for output. Dropped back to just two lines of code: “a = Load…();” and “dump a;”. Still got nothing. Am running on a 4gb machine giving VirtualBox 2gb. All actions are horribly slow.

vijay kokkula
|
June 26, 2014 at 3:50 am
|

very very helpful……

Vibhor
|
June 25, 2014 at 12:43 am
|

Great ! for beginners..
Would you like to cover Oozie and Job Designer as well.

|
June 17, 2014 at 8:32 pm
|

The tutorial is easy enough. I encourage everyone interested in Hadoop to follow it. I will do the rest of the tutorials…

Omer
|
June 12, 2014 at 1:09 am
|

Had a lot of fun going through this tutorial. Thanks guys!
I did have to change the ‘hive.security.authorization.enabled’ property in ‘/etc/hive/conf/hive-site.xml’ to false to be able to successfully create the table.

Nandini
|
May 29, 2014 at 6:06 am
|

Its very useful tutorial for beginners. Great Job!!!

satish
|
May 27, 2014 at 1:25 pm
|

Very nice and clear details. Thank you so much.

Raveendra
|
May 26, 2014 at 4:33 pm
|

Very good tutorial to get head start. Excellent job.

Niamul
|
May 21, 2014 at 1:18 am
|

I am a computer science minor and pretty much into UNIX systems. I was randomly trying to play with hadoop for last several days. Now I am in love with it. Thanks for the easy explanations.

Jeff LaMarre
|
May 20, 2014 at 6:51 am
|

I enjoyed this very much. Just enough information was provided to keep it simple and clear – a good point of departure to learning more!

Wayne
|
May 18, 2014 at 4:35 pm
|

I’m trying to run the pig script example, but I keep getting the following error “ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory”. It’s true that there is no such jar on that path.

Any idea how to fix it? Cheers!

Prashant Nilayam
|
May 13, 2014 at 10:59 am
|

when i run pig script am getting error :
ls: cannot access /usr/lib/hive/lib/slf4j-api-*.jar: No such file or directory
2014-05-13 10:53:41,551 [main] INFO org.apache.pig.Main – Apache Pig version 0.12.1.2.1.1.0-385 (rexported) compiled Apr 16 2014, 15:59:00
2014-05-13 10:53:41,552 [main] INFO org.apache.pig.Main – Logging error messages to: /hadoop/yarn/local/usercache/hue/appcache/application_1399920941808_0008/container_1399920941808_0008_01_000002/pig_1400003621548.log
ERROR:-c (-check) option is only valid when executing pig with a pig script file)
any idea how to fix it?

    Sagar Allamdas
    |
    July 23, 2014 at 10:46 am
    |

    Just go to Sandbox shell….
    write…
    pig -useHcatalog
    —insert ur pig script here—-

    it works fast and great

Goun Na
|
May 3, 2014 at 6:42 pm
|

I am using Hortonworks Sandbox 2.1

Regarding Hue, I think it’s better to include the url, http://localhost:8000/ in the tutorial. As a biginner who just started Hadoop there is a no way to figure out where the screen captures come from.

    Long Nguyen Vu
    |
    June 29, 2014 at 3:23 am
    |

    You’re right.
    But actually it’s not really localhost:8000 but http://sandbox_ip_address:8000 for those who install on other machine (for example: using VMware or VirtualBox)

|
April 28, 2014 at 11:33 am
|

Fun and painless!

Har Puri
|
April 22, 2014 at 2:22 pm
|

Thank you for this concise distribution and clear tutorial !
The only change is that the “sandbox” user is now (version 2.0) changed to “hue”
I ran this on KVM with 2Gb RAM and it worked great. A perfect hello world.

Diego Pajarito
|
April 20, 2014 at 8:20 pm
|

It was very simple for understanding and just the beginning of tones of Big Data jobs I wanna do.

For me is a little bit difficult not to think in SQL and the alternative solution in the relational world, let’s see what happens on my next tutorial.

Shakirahamd
|
April 17, 2014 at 2:54 am
|

Very very good post and it is very helpful for beginners,very good explanation of Pig script.
Thanks

|
April 4, 2014 at 5:54 am
|

This is exactly what I needed. I had read lots of High Level stuff about how hadoop works etc. What I really needed was to just get started. Sandbox provides the opportunity . Looking forward to rest of the tutorials.

Anusha Vysyaraju
|
April 1, 2014 at 7:34 am
|

Simply superb :) This post is quite understandable for beginners :)

Lifna C.S
|
March 31, 2014 at 6:56 am
|

Really helpful tutorial for beginners……….
Thank You very much….

|
March 27, 2014 at 5:08 pm
|

As a new learner to big data, and not having much SQL experience at all, outside of VERY SIMPLE teradata scripting, this is pretty helpful.

Parthiban
|
March 26, 2014 at 7:26 am
|

very gud article for the beginners

Ross Updegraff
|
March 25, 2014 at 1:01 pm
|

Great tutorial. It was easy to follow and helpful!

chandramouli
|
March 8, 2014 at 8:01 pm
|

Cool. Entered BIG DATA in 3 hours.

Richard Clapp
|
February 26, 2014 at 11:15 am
|

Need to add how to connect information to start the Sandbox tutorials. Many people have many problems getting to the screens you show in the tutorials. If this is aimed towards a beginner, please add some basic setup info for the beginner, such as Putty setup and what website adr to use. Right now, many people are giving up before they even get started.

    Cheryle Custer
    |
    February 26, 2014 at 12:28 pm
    |

    Hi Richard,

    You don’t need to set up Putty to work through the first 7 or 8 tutorials. The website address is given on the VM start up screen — depending on Virtualization tool, is likely: 127.0.0.1:8888 (see slide 8 of Slideshare). Please review this slide share for help:

    http://www.slideshare.net/hortonworks/sandbox-startup

      Max Mir
      |
      September 23, 2014 at 11:24 pm
      |

      Thank you for pointing that out. I am fairly technical, but I was stuck too – I had already logged into the sandbox and missed the instructions. It would really be helpful to point that out in the tutorial rather than have to search through the comments.

Student
|
February 22, 2014 at 10:19 pm
|

This is awesome :)
Reminded me Below Albert Einstein statement :

“If you can’t explain it simply, you don’t understand it well enough”

Keep up the good work :)

Aninda
|
February 18, 2014 at 9:23 pm
|

Excellent article for a beginner in hadoop and related tool/ technology stacks. Keep it up!

Raman
|
February 13, 2014 at 3:38 am
|

Wonderful job.It took hardly no time for me to setup sandbox and run this samples.

Paul Caballero
|
February 12, 2014 at 12:38 am
|

Easy to follow instructions,

arvin levine
|
February 9, 2014 at 8:38 pm
|

I did the tutorial and enjoyed it. but how would i generalize the PIG example to return all stock_volume averages? I tried a few thoughts based on the tutorial and they all failed syntax:

a = LOAD 'nyse_stocks' USING org.apache.hcatalog.pig.HCatLoader();
b = group a BY stock_symbol;
c = group b all;
d = foreach c generate stock_symbol, AVG(c.stock_volume);
dump d;

and I don't know how to troubleshoot this stuff yet (I looked at the log, but it just tells me there is an error: Invalid scalar projection: c
thnks.

    Amir
    |
    March 23, 2014 at 3:52 pm
    |

    Hi Arvin,

    Not sure if you’re still looking for a solution to the question you posted on Feb 9 but I think I have the solution…

    If I understand correctly you want to calculate the average stock volume for each stock symbol, right?

    This piece of code seems to work for me:
    a = load ‘nyse_stock’ using org.apache.hcatalog.pig.HCatLoader();
    b = group a by stock_symbol;
    c = foreach b generate group, AVG(a.stock_volume);
    dump c;

    This instruction is not needed:
    c = group b all;

    Because it will create a relation with only 1 tuple in it, grouping all stocks into 1 massive row which is not what you want:
    (all, {(IBM, …), (APPLE, …)})

    If you stop at “b = group a by stock_symbol;” then you have a relation with 1 tuple per stock symbol, which enables you to calculate an average for each stock:
    (IBM, { (IBM, …) })
    (APPLE, { (APPLE, …) })

    I’m new to this and have only completed the first tutorial, so maybe I am not explaining clearly but hopefully this will help.

    Regards,

Varun
|
February 6, 2014 at 10:08 pm
|

Good Post ….

SivaPrakash
|
January 28, 2014 at 2:39 pm
|

Very clearly explained.
installed on windows 7 , oracle VM and sandbox version 2.

Mohamed Daif
|
January 26, 2014 at 2:23 pm
|

I believe the right description for this tutorial would be sexy :D.
I always admire the ability to write technical information in an elegant and easy way.
Keep up the great work :)

Nag
|
January 21, 2014 at 7:33 pm
|

Cool Tutorial, was able to understand HDFS/HDP,Hcatalog,beeswax & Pig. Thanks.

Fizal
|
January 21, 2014 at 2:50 pm
|

This is an excellent tutorial for beginners. Looking forward for more advanced stuff.

venkat
|
January 20, 2014 at 1:52 pm
|

Its a great post !!

Highly impressive.

|
December 24, 2013 at 12:15 pm
|

Just an FYI as I’m going through the tutorials now…the “sandbox” user doesn’t exist on the Virtual Box. The following is the listing for home/user

ambari-qa ambari-qa hdfs drwxrwx— October 20, 2013 06:07 pm
guest guest guest drwxr-xr-x October 28, 2013 08:34 am
hcat hcat hdfs drwxr-xr-x October 20, 2013 03:12 pm
hive hive hdfs drwx—— October 20, 2013 03:12 pm
hue hue hue drwxr-xr-x December 24, 2013 11:10 am
oozie

Varghese Daniel
|
December 19, 2013 at 5:50 pm
|

Very helpful for a beginner to practice pig commands and understand the process methodology.

Ted Kahn
|
December 17, 2013 at 12:52 pm
|

Very good. A few UI inconsistencies between the tutorial and actual program, due I guess to the program changing. I would have appreciated a bit more explanation about the various Pig commands. Particularly, the GROUP cmd.

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Try this tutorial with :

These tutorials are designed to work with Sandbox, a simple and easy to get started with Hadoop. Sandbox offers a full HDP environment that runs in a virtual machine.