The Hortonworks Blog

Posts categorized by : Pig

Last week, Apache Tez graduated to become a top level project within the Apache Software Foundation (ASF). This represents a major step forward for the project and is representative of its momentum that has been built by a broad community of developers from not only Hortonworks but Cloudera, Facebook, LinkedIn, Microsoft, NASA JPL, Twitter, and Yahoo as well.

What is Apache Tez and why is it useful?

Apache™ Tez is an extensible framework for building YARN based, high performance batch and interactive data processing applications in Hadoop that need to handle TB to PB scale datasets.…

The Apache Pig community released Pig 0.13. earlier this month. Pig uses a simple scripting language to perform complex transformations on data stored in Apache Hadoop. The Pig community has been working diligently to prepare Pig to take advantage of the DAG processing capabilities in Apache Tez. We also improved usability and performance.

This blog post summarizes the progress we’ve made.

Support for Backends Other Than MapReduce

We made the Pig 0.13 architecture more general to support multiple backends beyond just MapReduce, while maintaining backward compatibility.…

The first use of the term BoF session was used at the Digital Equipment Users’ Society (DECUS) conference in the 1960s. Its essence was to bring together like minds and thought leaders—just as birds of the feather flock together— to share and exchange computing ideas, in an informal yet spirited way. Since then, the organizers and sponsors of most computing conferences have been loyal to its essence and spirit.

For ideas and innovation happen in collaboration—not in isolation. …

The Apache Tez team is proud to announce the first release of Apache Tez – version 0.2.0-incubating.

Apache Tez is an application framework which allows for a complex directed-acyclic-graph of tasks for processing data and is built atop Apache Hadoop YARN. You can learn much more from our Tez blog series tracked here.

Since entering the Apache Incubator project in late February of 2013, there have been over 400 tickets resolved, culminating in this significant release.…

The last couple of weeks have been a period of intense activity around the Apache projects that comprise the Hadoop ecosystem. While most of the headlines were accorded to Apache Hadoop 2 going GA, it would be remiss not to pay attention to the great progress being made in the Apache projects that complement Hadoop.

We have blogged about these over the course of the past week and the list below provides a quick summary of the phenomenal work contributed in the open by the folks driving these diverse and vital communities.…

Today we are proud to announce the general availability of Apache Pig 0.12!

If you are a Pig user and you’ve been yearning to use additional languages, for more data validation tools, for more expressions, operators and data types, then read on. Version 0.12 includes all of those additions, and now Pig runs on Windows without Cygwin.

This was a great team effort over the past six months with over 30 engineers from Twitter, Yahoo, LinkedIn, Netflix, Microsoft, IBM, Salesforce, Mortardata, Cloudera and several others (including Hortonworks of course).…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Alan Gates, Hortonworks Co-Founder and Apache Pig Committer, discusses using Apache Pig for efficiently managing MapReduce workloads. Pig is ideal for transforming data in Hadoop: joining it, grouping it, sorting it and filtering it.

Alan explains how Pig takes scripts written in a language called Pig Latin and translates those into MapReduce jobs.

Listen to Alan describe the future of Pig in Hadoop 2.0.…

Cat Miller is an engineer at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

Introduction

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.…

Apache Pig version 0.11 was released last week. An Apache Pig blog post summarized the release. New features include:

  • A DateTime datatype, documentation here.
  • A RANK function, documentation here.
  • A CUBE operator, documentation here.
  • Groovy UDFs, documentation here.

And many improvements. Oink it up for Pig 0.11! Hortonworks’ Daniel Dai gave a talk on Pig 0.11 at Strata NY, check it out:…

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Most of the bug fixes went against the REST server, replication, region assignment, secure client, flaky unit tests, 0.92 compatibility and various stability improvements. Some of the interesting patches in this release are: [HBASE-3996] – Support multiple tables and scanners as input to the mapper in map/reduce jobs [HBASE-5416] – Improve performance of scans with some kind of filters.…

Pig can easily stuff Redis full of data. To do so, we’ll need to convert our data to JSON. We’ve previously talked about pig-to-json in JSONize anything in Pig with ToJson. Once we convert our data to json, we can use the pig-redis project to load redis.

Build the pig to json project:

git clone git@github.com:rjurney/pig-to-json.git ant

Then run our Pig code:

/* Load Avro jars and define shortcut */ register /me/Software/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/Software/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); register /me/Software/pig-to-json/dist/lib/pig-to-json.jar register /me/Software/pig-redis/dist/pig-redis.jar -- Enron emails are available at https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro emails = load '/me/Data/enron.avro' using AvroStorage(); json_test = foreach emails generate message_id, com.hortonworks.pig.udf.ToJson(tos) as bag_json; store json_test into 'dummy-name' using com.hackdiary.pig.RedisStorer('kv', 'localhost');

Now run our Flask web server:

python server.py

Code for this post is available here: https://github.com/rjurney/enron-pig-tojson-redis-node.…

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.…

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.

It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change.…

Go from Zero to Big Data in 15 Minutes!

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

  • A free download
  • A complete, self contained virtual machine with Apache Hadoop pre-configured
  • A personal, portable and standalone Hadoop environment
  • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop.…

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Some of the notable changes include:

  • Source code-only distribution

In the download section for Pig 10.0.1, you will now find a source-only tarball (pig-0.10.1-src.tar.gz) alongside the traditional full tarball, rpm and deb distributions.…

Go to page:1234