Hortonworks on Apache Hadoop


Pig Performance and Optimization Analysis

Introduction

In this post, Hortonworks Intern Jie Li talks about his work this summer on performance analysis and optimization of Apache Pig. Jie is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.

Pig Performance Analysis and Optimization

I am proud that I was among the first several interns at Hortonworks, one of the leaders in the Hadoop community. In this post, I want to summarize my project on Pig performance and also share my experience this summer.

I began working on Pig one year ago, when my classmates in CPS216 and I developed the TPC-H benchmark for Pig, in order to compare the performance of Pig and Hive. TPC-H (specified here) consists of a set of complex queries and is the well-known benchmark for the traditional data warehouse.…

Read More

Apache Hadoop YARN – ResourceManager

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – ResourceManager

As previously described, ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers (NMs) and the per-application ApplicationMasters (AMs).

  1. NodeManagers take instructions from the ResourceManager and manage resources available on a single node.
  2. ApplicationMasters are responsible for negotiating resources with the ResourceManager and for working with the NodeManagers to start the containers.

ResourceManager Components

The ResourceManager has the following components (see the figure above):

  1. Components interfacing RM to the clients:
    • ClientService: The client interface to the Resource Manager. This component handles all the RPC interfaces to the RM from the clients including operations like application submission, application termination, obtaining queue information, cluster statistics etc.

Read More

Recap of the August Pig Hackathon at Hortonworks

The August Pig Hackathon brought Pig users from Hortonworks, Yahoo, Cloudera, Visa, Kaiser Permanente, and LinkedIn to Hortonworks HQ in Sunnyvale, CA to talk and work on Apache Pig.

Jonathan Coveney and Bill Graham from Twitter walked newer Pig users through how Pig translates a Pig Latin script to map reduce jobs and went over how to read the output of explain. For those interested, Hortonworks founder Alan Gates covers this in Chapter 1 of Programming Pig.

Thejas Nair walked through how to contribute patches to Pig and how to work with committers to get the patches in. You can learn more about this on the Pig Wiki.

The group talked through the proposal for a new EvalFunc interface that would make it much easier to write UDFs or User Defined Functions for Pig. Part of what makes Pig so powerful is its extensibility, and making that even easier would make Pig a better tool.…

Read More

HA Namenode for HDFS with Hadoop 1.0 – Part 1

Introduction

A Highly Available NameNode for HDFS has been in development since last year. That effort focused singularly on the automatic failover of the NameNode for Hadoop 2.0. During that time we have realized two things.

First, we realized we should use an outside-in approach to the HA problem: start by designing the availability of the Hadoop system as a whole and then focus on the high-availability of individual components; that work lead to the Full Stack HA Architecture.

Second, we realized that we can build an HA NameNode for Hadoop 1.0 using industry proven solutions such as Linux HA and vSphere; this is important because HDFS in Hadoop 1 is been proven to be stable and reliable, while HDFS in Hadoop 2 is just beginning beta testing. This blog describes some technical details of HDFS NameNode HA in Hadoop 1.…

Read More

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-jruby-sinatra-hbase-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local.…

Read More

Hadoop: Your Partner in Crime

Pre-crime? Pretty close…

If you have seen the futuristic movie Minority Report, you most likely have an idea of how many factors and decisions go into crime prevention. Yes, Pre-crime is an aspect of the future but even today it is clear that many social, economic, psychological, racial, and geographical circumstances must be thoroughly considered in order to make crime prediction even partially possible and accurate. The predictive analytics made possible with Apache Hadoop can significantly benefit this area of government security.

The essence of crime prevention is to understand and narrow down thousands of “what if” cases to a manageable and plausible handful of scenarios. Crime can happen anywhere and can be categorized as anything from cyber fraud to kidnapping, which provides a lot of combinations for possible misdemeanors or felonies. With the help of big data analytics, government agencies can zone in on certain areas, demographics, and age groups to pick out specific types of crimes and move towards decreasing the one trillion dollar annual cost of crime in the United States.…

Read More

UC Irvine Medical Center: Improving Quality of Care with Apache Hadoop

This is the first part of a series written by Charles Boicey from the UC Irvine Medical Center.  The series will demonstrate a real case study for Apache Hadoop in healthcare and also journal the architecture and technical considerations presented during implementation.

With a single observation in early 2011, the Hadoop strategy at UC Irvine Medical Center started. While using Twitter, Facebook, LinkedIn and Yahoo we came to the conclusion that healthcare data although domain specific is structurally not much different than a tweet, Facebook posting or LinkedIn profile and that the environment powering these applications should be able to do the same with healthcare data.

In healthcare, data shares many of the same qualities as that found in the large web properties.  Each has a seemingly infinite volume of data to ingest and it is all types and formats across structured, unstructured, video and audio.…

Read More

Apache Hadoop, the Energy Softgrid and my Imaginary Tesla

This week, I spent some time and enjoyed speaking at the Softgrid 2012 conference in San Francisco. It was a great collection of speakers and attendees and opened my eyes to some Hadoop driven possibilities that not only differentiate utilities companies but will also transform our day-to-day lives.

The conference focused on software (in this case intelligent analytics) as a competitive advantage to enable value and growth for utilities.  These often large and historically conservative organizations have moved beyond the notion that their sole business is to distribute electric power efficiently, reliably, and cost-effectively to consumers. They now rely on analysis of massive amounts of data they already collect from smart meters and existing networks about distribution and consumption, and are taking progressive action on that data.

As we have seen in other markets, such as Financial Services and Retail, data is becoming the currency for an energy market transformation.…

Read More

Hadoop & Big Data Seminar, Coming to a City Near You

Do you want to understand how Apache Hadoop can benefit your business? Do you understand the relationship between Hadoop and your Big Data initiatives? Are you struggling to explain the benefits of Hadoop to your management team?

At Hortonworks, we are constantly being asked by business and executive audiences to explain use cases, benefits and components of Hadoop. While the interest in Big Data and Hadoop grows, this urgent and often pressing demand for a map to create value and differentiation amplifies.

Good news, Hortonworks is hosting a half-day seminar series specifically targeted at IT Managers, Directors, and Executives. The focus of these sessions will be “Big Business Value from Big Data and Hadoop.”

We are thrilled at the reception these events have already garnered and urge you to register before seats are full. The list of cities and dates include:

  • Seattle – Sept 19
  • Los Angeles – Sept 20
  • Chicago – Sept 25
  • Dallas – Sept 26
  • San Francisco – Sept 27
  • DC – Oct 9
  • New York – Oct 10
  • Boston – Oct 11

REGISTER

We hope to see you there!…

Read More

Pig as Hadoop Connector, Part One: Pig, MongoDB and Node.js

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems, to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use are available at https://github.com/rjurney/enron-node-mongo and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local.…

Read More

Apache Hadoop YARN – Concepts and Applications

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Concepts & Applications

As previously described, YARN is essentially a system for managing distributed applications. It consists of a central ResourceManager, which arbitrates all available cluster resources, and a per-node NodeManager, which takes direction from the ResourceManager and is responsible for managing resources available on a single node.

Resource Manager

In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, it’s strictly limited to arbitrating available resources in the system among the competing applications – a market maker if you will.  It optimizes for cluster utilization (keep all resources in use all the time) against various constraints such as capacity guarantees, fairness, and SLAs.…

Read More

City Hall is Getting Schooled

Nothing happens in a vacuum anymore.  Cities now have the ability to use information collected from a massive variety of sources in order help solve common city problems.  The information can arise from anywhere – tweets, blog posts, and meter readings all can serve to inform public officials (and citizens as a whole) about how to better interact in a data-drenched world.

Most famously, IBM’s Smart Cities initiative looks at how city governments meet the needs of their expanding populations by using available resources more efficiently.  This is in direct contrast to the older practices of extracting ever-greater amounts of natural resources.  For example, optimizing how power plants distribute energy to city grids can alleviate power concerns during the summer months were A/C usage creates huge power demands.  The insight into how to do this better is always better than blind foresight.…

Read More

Apache Hadoop YARN – Background and an Overview

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – Background & Overview

Celebrating the significant milestone that was Apache Hadoop YARN being promoted to a full-fledged sub-project of Apache Hadoop in the ASF we present the first blog in a multi-part series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.

MapReduce – The Paradigm

Essentially, the MapReduce model consists of a first, embarrassingly parallel, map phase where input data is split into discreet chunks to be processed. It is followed by the second and final reduce phase where the output of the map phase is aggregated to produce the desired result.…

Read More

Introducing Apache Hadoop YARN

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Introducing Apache Hadoop YARN

I’m thrilled to announce that the Apache Hadoop community has decided to promote the next-generation Hadoop data-processing framework, i.e. YARN, to be a sub-project of Apache Hadoop in the ASF!

Apache Hadoop YARN joins Hadoop Common (core libraries), Hadoop HDFS (storage) and Hadoop MapReduce (the MapReduce implementation) as the sub-projects of the Apache Hadoop which, itself, is a Top Level Project in the Apache Software Foundation. Until this milestone, YARN was a part of the Hadoop MapReduce project and now is poised to stand up on it’s own as a sub-project of Hadoop.

In a nutshell, Hadoop YARN is an attempt to take Apache Hadoop beyond MapReduce for data-processing.…

Read More

Healthcare Goes Big

Earlier, in the “Big Data in Genomics and Cancer Treatment” blog post, I explored how the extensive amount of information in DNA analysis mostly comes from the vast array of characteristics associated with people’s DNA make up and with different cancer variations. The case with today’s healthcare is very similar. Each patient is unique and has thorough medical history records that allow doctors to make evaluations and recommendations for future treatments. These records also contain various drugs, therapies, diets, and regimens that must coincide with the patient’s condition and which, if not followed correctly, could endanger the patient’s life.

“Doctor, can I have some of that Big Data?”

Currently, the medical field is overflowing with big data and there is huge potential for improvement in treatment quality and overall patient experience. With the use of big data analytics, health care and pharmaceutical companies could significantly advance the services that they offer their patients.…

Read More

Go to page:« First...910111213...Last »