Word Counting with Apache Pig

Community Tutorial

This tutorial is from the Community part of tutorial for Hortonworks Sandbox – a single-node Hadoop cluster running in a virtual machine. Download to run this and other tutorials in the series.
This community tutorial submitted by flacrosse with source available at Github. Feel free to contribute edits or your own tutorial and help the community learn Hadoop.


This tutorial describes how to use Pig with the Hortonworks Sandbox to do a word count of an imported text file.

Create a text file with data

This can be anything but I ended up using the output of some textual data I had in SQL and dumping it into a text file. It’s definitely a little more interesting if you can work with some data you know or at least have an interest in.

Import the file into the Sandbox

Go to the File Browser tab and upload the .txt file. Take note of the default location it is loading to (/user/hue).

Alt text

Write a Pig script to parse the data and dump to a file

I put this code together from snippets I found on the web. The key thing here is to make sure your load statement is referencing the location where your file lives and that you specify an output location to store the file. Note: I didn’t create the /pig_wordcount folder before I ran this, the script ended up creating the location which was a handy feature. Just hit execute and sit back, you can check the run status on the query history tab.

a = load '/user/hue/word_count_text.txt';
b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
c = group b by word;
d = foreach c generate COUNT(b), group;
store d into '/user/hue/pig_wordcount';

Alt text

Use HCatalog to load the file to a “table”

Being a SQL developer by day I wanted to be able to query the results in a familiar way so I decided to create a table using HCatalog so that it would be easily accessible through Hive. So I went into the HCatalog tab and chose the file from the folder I specified, named the table and columns, and hit create table. It churned for a while but eventually completed.

Alt text

Use Hive to query and sort the data for final output

Finally, I went into the Hive tab and wrote a quick query to return and organize the results. Once it was completed I downloaded it and put the results in Excel so I could print and frame them.

Alt text

Alt text


April 27, 2014 at 7:49 pm

In the pig script query has executed succesfully but, in the
Hcatalog step to create table it displays every time an error
“delimiter preview error : line contains null byte” please help me to detect this error what i can do.

Jules S. Damji
August 11, 2014 at 4:41 pm

For this tutorial, you need a text file. Bug generally, if you logs files are text files, wherein each line can be tokenized. PDFs files can be converted, first, to tex, and then loaded into for analysis.

August 31, 2015 at 12:26 pm

I don’t see this screen in Ambari/pig page. we have installed the hortonworks appliance and I don’t see most of the screens that are in these tutorials. please help.
we wanted to do demo.

    November 3, 2015 at 11:45 pm

    The interface used in this example is HUE. You can do it also in Ambari/Pig and Ambari/Hive.

November 11, 2015 at 9:06 am

What if you want to load multiple text files, located in various levels of directories? Example, files may be located:
I’m new to Pig, but would it be something along the lines of this?

a = load ‘filepath/folder1’; –How does this load? simply as a list of text files or something?
b = foreach a generate flatten($0) as textfile;
c = foreach textfile generate flatten(TOKENIZE((chararray)$0)) as word;
d = group c by word;
e = foreach d generate COUNT(c), group;
store e into ‘otherfilepath’;

    November 11, 2015 at 9:11 am

    Oops. That last line should be
    dump a

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">