Category Archives: Apache Hadoop


Stinger Early Results: 45X Performance Increase for Hive

Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan

A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.

In order to get a fair assessment we styled our tests after the TPC Benchmark™ DS (TPC-DS). For this initial report, we provide detail around two of the most common use cases and as we execute more queries and make more improvements we will provide more detail.

Performance Tests & Environment

In this report we provide results for two of our performance queries.  In the first, we perform a star schema join where we load all the small tables in memory and do a scan through the fact table independently on all nodes. In the second query, we perform a join between two large fact tables, both of which are too large to fit in memory.

Our test environment comprised of a 10 node ec2 cluster with a total of 100 containers over 40 disks, to obtain query execution times with Hive on raw data and with Hive with all the optimizations enabled on partitioned data stored in RCFile format. We also use a scale factor of 200, which implies a data set of around 200GB.

Results

For the first query, we’ve calculated as much as a 35X improvement over native Hive and have reduced query times from around 1400 seconds to 39! And for second query we calculate a whopping 45X improvement… all in open source Apache Hive.

StingerQuery1  StingerQuery2

This is a preliminary look at test results but it indicates significant improvements already. We are pretty excited about the results, as the work has only just begun.  The test results above do not even include the Tez improvements, nor do they include the new ORCFile format!  And there are still a handful of other Hive specific improvements coming.  Some of the improvements are included in these tests include HIVE-3784, HIVE-3952 and HIVE-2340.  A before and after of the execution looks like this:

BeforeStinger

AfterStinger

In subsequent posts we will deliver more explicit results and provide a more in-depth specifications for hardware/software used for the benchmarks. We’ll also describe few other efforts, which will complete our story of attaining 100x gains we talked about earlier, so stay tuned.

We truly believe that the fastest path to innovation is the open community and this is a great example of how quick the community can prove this true.  These advances are not attributed to Hortonworks alone; they were completed in partnership with the community with resources from Yahoo!, Facebook, Twitter, SAP and Microsoft all contributing.

NOTE: We are talking about all of this and much more at Hadoop Summit Amsterdam. Attend our talk “Innovations In Apache Hadoop MapReduce, Pig and Hive for improving query performance” to learn more about this effort. “Optimizing Hive Queries” by Owen O’Malley and “What’s New and What’s Next in Apache Hive” by Gunther Hagleitner are two other talks that you should attend to learn more about other threads in our stinger initiative.

Touring Ambari

Hot on the heels of the release of the new version of Sandbox, I thought it would be worth a look at Ambari as it is now integrated into the Sandbox VM. You can download the Hortonworks Sandbox and try it out for yourself!

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It greatly simplifies and reduces the complexity of running Apache Hadoop. Ambari is a fully open-source, Apache project and graphical interface to Hadoop.

ambari_dashboard

The Ambari Dashboard serves as a home page for your cluster, defining key metrics and linking you through to particular services on the cluster.

ambari_heatmap

Heatmaps show which parts of your cluster are the least or most active, which can help with capacity and load management.

ambari_services

The Ambari Services interface lets you monitor cluster-wide services on your Hadoop cluster.

ambari_hosts

The Ambari Hosts interface lets you drill down to individual hosts that make up your cluster.

ambari_jobs

The Ambari Jobs interface lets you examine the individual applications and jobs that makeup your Hadoop workload.

ambari_users

The Ambari Users interface helps you administer new users on your Hadoop cluster. You can try it out by downloading the new Hortonworks Sandbox. We hope you enjoyed this post, please let us know by commenting!

Week in Review: From Plastics to Windows

We’re wrapping up another busy week at Hortonworks towers. I say another, but actually this is my first week. So… it’s a hello from me, I’m Marc Holmes, Community Director. What have we been talking about this week?

Plastics and Hadoop: discuss! We started the week with a post from our VP of Products, Bob Page drawing an analogy to the growth of the plastics industry with the disruption to the database market driven by Hadoop, looking at the connections and differences to SQL and pointing out ‘what we don’t know yet’ on the evolution of use cases for Hadoop.

Hadoop and Windows sitting in a tree… Arun and Suresh highlighted the joint effort between Hortonworks and Microsoft to make Apache Hadoop run natively on Windows, and celebrated the community vote to move this work into the mainline trunk. We’re community-driven open source folk and we’re delighted not only by the code, but the spirit of community contribution throughout. Microsoft talked about this work over on their Port 25 blog.

Out there. Meantime, there was a LOT of discussion on a couple of articles including this one - Proprietary Hadoop is a Losing Strategy - and this one - One Hadoop Distribution To Rule Them All as a follow up. We believe, and Arun points out, that ‘ultimately the winners in Hadoop will be those investing most heavily in its success’.

But what do you think at a personal level? Do you want Hadoop skills, or Hadoop-a-like skills? Let us know.

And finally, talking of skills, Russell Jurney explained how to Install Hadoop on Windows. So now you know.

Next week… should be quiet. Only the Hadoop Summit in Amsterdam, and a bunch of exciting stuff we’ll tell you more about then. Stay out of trouble and enjoy the show!

Expanding the Apache Hadoop Community to Windows

This post co-authored by Arun Murthy.

It’s been an exciting time for the Apache Hadoop community with new and innovative projects happening around performance (Apache Tez) — part of the Stinger initiative — and security (Apache Knox). In addition Hortonworks recently announced the availability of the beta version of Hortonworks Data Platform for Windows.

One of the things we believe strongly in here at Hortonworks is community driven open source and, obviously, the bigger the community, the better. The community opens itself up to new members by the developmental choices it makes and last week the Apache Hadoop community voted to significantly expand itself by agreeing to accept enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure. These enhancements were the result of many, many months of joint engineering work from Microsoft and Hortonworks and we are glad to see the community accept and embrace them. So far, as is common in the Apache Hadoop project, we developed these in a development branch for over a year and once this work was complete, the community voted to incorporate these changes into the mainline trunk.

Here are the highlights of the work done:

  • Command-line scripts for the Hadoop surface area
  • Mapping the HDFS permissions model to Windows
  • Abstracted and reconciled mismatches around differences in path semantics in Java and Windows
  • Native Task Controller for Windows
  • Implementation of a Block Placement Policy to support cloud environments, more specifically Windows Azure.
  • Implementation of Hadoop native libraries for Windows (compression codecs, native I/O)
  • Several reliability issues, including race-conditions, intermittent test failures, resource leaks.
  • Several new unit test cases written for the above changes

This is great news for the Apache Hadoop ecosystem because it enables a whole new swath of organizations using Microsoft Windows and, equally importantly, end-users to work with Apache Hadoop in their preferred environment. There is also the substantial ecosystem of technology vendors who build solutions for the Microsoft Windows platform who can now integrate their solutions on Windows. Additionally the system integrators who have invested and created expertise around the Windows platform will be able to extend their skills to Hadoop on Windows.

Of course it is also a great demonstration of contributing back to the community so that anyone can benefit from this work. It is also notable that our collaborative efforts with Microsoft also extend beyond core Apache Hadoop to projects like Apache Hive, Apache Pig, Apache Sqoop, Apache Oozie, Apache HCatalog and Apache HBase.

We at Hortonworks would like to extend our congratulations to Microsoft for giving back to the Apache Hadoop community and would like to extend a warm welcome; the community can look forward to seeing much more as we work together in the near future.

HOWTO install Hadoop on Windows

Installing the Hortonworks Data Platform for Windows couldn’t be easier. Lets take a look at how to install a one node cluster on your Windows Server 2012 machine. to let us know if you’d like more content like this.

msi_download
To start, download the HDP for Windows MSI at http://hortonworks.com/thankyou-hdp11-win/. It is about 460MB, and will take a moment to download. Documentation for the download is available here.

As indicated in the documentation here, first we must install Microsoft Visual C++ 2010 Redistributable Package (x64), available here.

Download and install .NET from here if you haven’t already.

We need to setup Java, which you can get here. We need to setup JAVA_HOME, which Hadoop requires. Make sure to install Java to somewhere without a space in the path, “Program Files” will not work!

To setup JAVA_HOME, in the file browsers -> right click computer -> Properties -> Advanced System Settings -> Environment variables. Then setup a new System variable called JAVA_HOME that points to your Java install (in this case, C:\java\jdk1.6.0_31).

JAVA_HOME

Finally, we need to download python from here and set the Path environment variable as we did JAVA_HOME. Go to Computer -> Properties -> Advanced System Settings -> Environment variables. Then append the install path to Python, for example C:\Python27, to this path after a ‘;’.

python_path

Verify your path is setup by entering a new shell and typing: python, which should run the python interpreter. Type quit() to exit. Now we’re ready for our configuration.

Next, notepad the file clusterproperties.txt, which we will setup for a simple, one node cluster operation. Note: first we need to discover our hostname, and enter it into our config instead of something generic like ‘localhost.’ Use the hostname command, for example:

hostname
WIN-4VLBRQK8FA8

We then place this hostname in our config. Be sure the replace the example value with your own hostname!

#Log directory
HDP_LOG_DIR=c:\hadoop\logs

#Data directory
HDP_DATA_DIR=c:\hadoop\data

#Hosts
NAMENODE_HOST=WIN-4VLBRQK8FA8
SECONDARY_NAMENODE_HOST=WIN-4VLBRQK8FA8
JOBTRACKER_HOST=WIN-4VLBRQK8FA8
HIVE_SERVER_HOST=WIN-4VLBRQK8FA8
OOZIE_SERVER_HOST=WIN-4VLBRQK8FA8
TEMPLETON_HOST=WIN-4VLBRQK8FA8
SLAVE_HOSTS=WIN-4VLBRQK8FA8

#Database host
DB_FLAVOR=derby
DB_HOSTNAME=WIN-4VLBRQK8FA8

#Hive properties
HIVE_DB_NAME=hive
HIVE_DB_USERNAME=hive
HIVE_DB_PASSWORD=hive

#Oozie properties
OOZIE_DB_NAME=oozie
OOZIE_DB_USERNAME=oozie
OOZIE_DB_PASSWORD=oozie

And finally, install HDP for Windows:

msiexec.exe /i "hdp-1.1.0-160.winpkg.msi" /lv install.log \
HDP_LAYOUT=c:\Users\Administrator\Downloads\clusterproperties.txt HDP_DIR=c:\HDP DESTROY_DATA="yes"

This will bring up an MSI install window. When it is done, to verify your installation, check the HDP_DIR it was installed to:

dir c:\HDP

You should see files, such as ‘start_local_hdp_services.cmd’. Run this file:

.\start_local_hdp_services.cmd

With services up, you’re in good shape to run the SmokeTests.

Run-SmokeTests.cmd

Which will fire off a mapreduce job right away. Congratulations, you’re Hadooping on Windows!

mapreduce

If you’d like to learn more about Hadoop, check out the Hortonworks Sandbox, a fully capable virtual machine for you to learn Hadoop with.

Seamless Reporting & Analytics for Apache Hadoop & Big Data Users

Jaspersoft, a Hortonworks certified technology partner, recently completed a survey on the early use of Apache Hadoop in the enterprise. The company found 38% of respondents require real-time or near real-time analytics for their Big Data with Hadoop. Also, within the enterprise, there is a diverse group of people who use Hadoop for such insights: 63% are application developers, 15% are BI report developers and 10% are BI admins or casual business users. Register for a free webinar to hear more.

So, for Hadoop users, the partnership between Hortonworks and Jaspersoft provides a good combination– Jaspersoft provides the ideal complement for reporting and analysis of Hadoop-based Big Data systems through a full suite of ETL, Apache Hive, and native Apache HBase connectors for low-latency data exploration. Not only does the company have an open source model that empowers users to deploy Big Data reporting and analytics quickly and cost-effectively, pre-defined reports make it easy for a wide group of users to gain and share immediate insight.

Jaspersoft joined the Hortonworks Technology Partner Program in 2012, extending advanced reporting capabilities to Hadoop users. The Hortonworks Technology Partner Program is designed to assist ISVs and other solution providers to integrate and extend their solutions for Hadoop, and includes a variety of technical enablement, technical support and training offerings. According to Hortonworks’ CTO Eric Baldschwieler, “Jaspersoft’s industry-leading reporting, analysis, and dashboard products, together with the Hortonworks Data Platform, make it easy and cost-effective for customers to derive maximum insights and value from their largest data stores.”

Choosing the right analytical approach

As easy as this sounds, there are still several approaches to analyzing and reporting on Big Data and numerous use cases— web analytics, fraud detection, security monitoring and healthcare just to name a few. Choosing the right approach depends on what insights you need and why you need them, and can make all the difference in how much value you extract from your data.

An upcoming webinar hosted by Hortonworks and Jaspersoft on March 13 will delve into the various architectural choices used in Hadoop reporting and analytics, and several use cases will be discussed. Register now.

 

Getting Ready for The Elephant Party in Europe

We are just under two weeks away from start of the first ever Hadoop Summit Europe and with all of the final preparations being made we thought we would highlight some of the not to be missed activities in and around the event. The event is filling fast but you can still register here.

Here are 10 great reasons to attend!

1)   Great track content – there are 35 informative sessions on Apache Hadoop and related technologies for you to choose from selected by the community and delivered by the experts themselves.

2)   Great keynotes – leading industry analyst Matt Aslett will present the opening keynote and we will also hear from open source veteran Shaun Connolly as well as Hortonworks CTO Eric Baldeschwieler

3)   Hadoop in the Enterprise expert panel – We will have a live panel discussion from industry leaders incuding eBay, HSBC and Neustar discussing how and why they use Apache Hadoop.

4)   Meetups – the NLHUG and other communities will be meeting around the event.

5)   Lightening talks – we’ve got rapid fire content coming to you in the form of community selected lightening talks. These 5 minute sessions will give you a taste of a wide range of technologies and initiatives

6)   It’s Amsterdam – historic, edgy and fun!

7)   Ecosystem – The conference has the support of the broader Hadoop ecosystem so you can come and discuss Hadoop and big data in the solutions showcase.

8)   Community – The Apache Hadoop community is big and getting bigger. Come meet and mingle with other community members to learn about the latest goings on and make new connections.

9)   Get Hadoop certified – Calling all Hadoop Experts! We’re bringing certification to you! If you are ready to take the exam to become a Hortonworks Certified Apache Hadoop Developer (HCAHD) or a Hortonworks Certified Apache Hadoop Administrator (HCAHA).

10)   Get trained on Hadoop – we’ve got a host of classes available during the event to help you learn or sharpen your Hadoop skills. This includes a newly added Applying Data Science class. Check out the classes.

11)  BONUS reason – have a beer on us at the Hadoop Summit Party at the Heineken Experience a cool venue at a historic location.

Register now, don’t miss the party hope to see you there!

Putting the Elephant in the Window

 

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

windoweleWe are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows. HDP is the first and only Hadoop-based platform available on both Windows and Linux and provides interoperability across Windows, Linux and Windows Azure. With this release we are enabling a massive expansion of the Hadoop ecosystem. New participants in the community of developers, data scientist, data management professionals and Hadoop fans to build and run applications for Apache Hadoop natively on Windows. This is great news for Windows focused enterprises, service provides, software vendors and developers and in particular they can get going today with Hadoop simply by visiting our download page.

This release would not be possible without a strong partnership and close collaboration with Microsoft. Through the process of creating this release, we have remained true to our approach of community-driven enterprise Apache Hadoop by collecting enterprise requirements, developing them in open source and applying enterprise rigor to produce a 100-precent open source enterprise-grade Hadoop platform.

One of our goals at Hortonworks is to make Hadoop and enterprise viable data platform available on as many platforms as possible. In fact, it is already available today in a range of deployment options including: on-premise, virtual, cloud and an appliance. For organizations looking to leverage Apache Hadoop, they now have even more choices of deployment options between Linux and Windows, giving them more freedom to meet their internal policies and standards. For Microsoft Windows customers, they have complete portability of their Apache Hadoop applications between on premise and cloud deployments, as Hortonworks Data Platform for Windows and HDInsight Service on Windows Azure are built on exactly the same code line.

If you are in the SF Bay Area this week, you can talk to us live about the power of the Hortonworks Data Platform for Windows at booth #316 at the Strata Conference, taking place February 26-28 at the Santa Clara Convention Center in Santa Clara, Calif.

 We will also be conducting the webinar, “Unlocking the Other Half: Introduction to Hortonworks Data Platform for Windows,” on Tuesday, March 12 at 10 a.m. PST / 1 p.m. EST.

To register for the webinar, please visit http://info.hortonworks.com/Hortonworks_HDPonWindows_webcast.html.

 

Pig Eye for the SQL Guy

Cat Miller is an engineer at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

Introduction

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

What’s Similar?

The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.

WHERE → FILTER

The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.

HAVING → FILTER

Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.

ORDER BY → ORDER

This keyword behaves pretty much the same in Pig as in SQL.

JOIN

In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.

Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.

Control over Execution

SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.

Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.

Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.

Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.

As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.

What’s Different?

A Row By Any Other Name

The SQL paradigm is very straightforward—there are tables, and tables contain rows. Every select statement yields a set of rows, and each field in a row is a basic data type. Conceptually, the result of any SQL select statement can be imported into Excel without loss of information.

Pig introduces a mature nesting notion in tuples and data bags that changes the game significantly. Pig consists of data sets called relations (sometimes called aliases for the names they are given), and those contain records that are data tuples, which can in turn recursively contain data bags, data tuples, or data items. There is distinct lack of flatness in Pig, and the best way to see it is to explore how GROUP works.

Not Your Grandmother’s GROUP

In the handling of GROUP, SQL and Pig diverge significantly. SQL’s GROUP doesn’t exist outside of the aggregation performed on it; you would never SELECT * GROUP by field1–it just doesn’t make sense. Because everything in SQL is a row, the grouping created isn’t persistent—only the data produced aggregating over it remains.

Pig’s GROUP is an entirely different beast, albeit used for the same purpose. It is a persistent relation that can be used again and again, independent of what aggregations you might choose to perform on it later.

student_grades = GROUP grades by student;

This student_grades relation has two fields: one called group and populated with the value of student, the other called grades populated with a data bag containing tuples for all the entries with the same value of student.

group	grades
alyssa 	{< hacking101, 95 >, < english, 60 >}
ben	{< math, 90 >}

Now to do an aggregation, perform it on the student_grades alias.

average_grades = FOREACH student_grades GENERATE group, AVG(grades.value);
Procedural Paradise

This is the first thing any Google search on Pig will tell you, and it is the most glaringly obvious change from SQL. After having taught your brain for years how to turn an idea inside-out and mash all of its pieces into one query, Pig makes query writing feel like writing Java or C++. In addition to to obvious potential cognitive benefits, this has some technical ones as well.

Subquery Reuse

Ever write a query with a subselect, and then realize you actually needed to use that subtable twice in the query? Did you feel absolutely awful as you cut and pasted that subtable? Did you wonder whether your SQL plan would successfully manage to not calculate it twice? (Note: the WITH clause mitigates this pain in a lot of cases, but isn’t available in all flavors of SQL.)

Because in Pig Latin every step has a declared alias, reusing “subquery” tables is natural and intuitive, and generally does not involve building them twice.

Getting multiple queries out of one pipeline

In SQL you can find yourself in a place where you want to use the data, do some manipulation on it, and then take it in a few different directions. To do this in one query requires profligate use of JOIN, and enough parens to intimidate a LISP hacker.

In a Pig pipeline, any and all aliases produced along the way can be stored, and all it takes is adding a new STORE statement to the script.

User-Defined Functions

SQL has had decades for people to figure out what analytic functions they need for arbitrary data analysis, and so when you find yourself suddenly interested in extracting the day of the week from a date, that function is ready and waiting.

Pig’s list of built-in functions is growing, but is still dwarfed by what Oracle or MYSQL provides. What turns this into a tolerable constraint is that Pig allows the user to define aggregate or analytic functions in other languages (Java, Python, and others) and then apply them in Pig quickly and without fuss.

REGISTER udf.jar;

new_data = FOREACH my_data GENERATE udf.ImportantFunction(field1);

The Well-Disguised SQLer

In general, if you can think about it in SQL, you can do it in Pig. Be aware of the nested data structures, have a cheat sheet for syntax, and relish the ability to write queries the way your brain thinks them, and not the way SQL demands.

As a final thought, let’s resurrect our old friend the emp table, and take a look at some SQL to Pig Latin examples.

Average Salary by Location
SQL
SELECT loc, AVG(sal) FROM emp JOIN dept USING(deptno) WHERE sal > 3000 GROUP BY loc;
Pig Latin
filtered_emp = FILTER emp BY sal > 3000;

emp_join_dept = JOIN filtered_emp BY deptno, dept BY deptno;

grouped_by_loc = GROUP emp_join_dept BY loc;

avg_salary = FOREACH grouped_by_loc GENERATE group, AVG(emp_join_dept.sal);
Ordered Average Salary by Location

Suppose now that the following is true.

● The ‘loc’ field is a string/varchar field, and we have two pieces of software that automatically populate it. One stores values as lowercase, one as uppercase. (If this seems like a contrived example to you, you have chosen your employers and software vendors well.)

The new parts of the queries appear in bold.

SQL
SELECT standard_loc, AVG(sal) avg_salary FROM
(SELECT UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno) WHERE sal > 3000) std_table GROUP BY standard_loc;
Pig Latin
filtered_emp = FILTER emp BY sal > 3000;

emp_join_dept = JOIN filtered_emp BY deptno, dept BY deptno;

grouped_by_loc = GROUP emp_join_dept BY UPPER(loc);

avg_salary = FOREACH grouped_by_loc GENERATE group, AVG(emp_join_dept.sal);

This kind of change is friendlier to Pig because of the limitations in SQL’s GROUP BY clause; the trade-off of verboseness for clarity increases in value the more complex your query gets.

Above Average Salary for Location

Now suppose instead of arbitrarily selecting 3,000 as a threshold, which is going to overselect people living in large expensive cities, we want to select those employees who make more than twice the average for their location.

SQL

There are several ways to accomplish this, of course, but for illustration purposes the most vanilla version is shown here.

SELECT empno FROM 
(SELECT empno, UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno)) std_table1 JOIN
(SELECT standard_loc, AVG(sal) avg_salary FROM
(SELECT UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno)) std_table2 GROUP BY standard_loc) grp_table2
USING(standard_loc) WHERE sal > (2 * avg_salary);
Pig Latin
emp_join_dept = JOIN emp BY deptno, dept by deptno;

grouped_by_loc = GROUP emp_join_dept BY UPPER(loc);

loc_avg_salary = FOREACH grouped_by_loc GENERATE AVG(emp_join_dept.sal) as avg_salary, FLATTEN(emp_join_dept);

highly_paid = FILTER loc_avg_salary BY sal > (2 * avg_salary);

Here the dreaded subquery reuse rears its unattractive head in SQL and leads to a bit of a frankenquery. In contrast, the Pig script stays the same length, because we can effectively just shift the FILTER from the beginning to the end of our data flow. FLATTEN is a new entry to our function arena, for which there is no SQL analogue. What FLATTEN does is unnest tuples and data bags; in this example it’s taking the data bag ‘emp_join_dept’ created by the GROUP function, and removing the nesting so that the fields within it will be at the same level as avg_salary.

Run It Yourself for the Swine-Curious

If you want to to try out some Pig examples hands-on, you can get a free Mortar account here. We’ve generated a one million row emp table data set, so to run these examples on your own all you need to do is:

    1. Sign up at app.mortardata.com for a free account.
    2. Go to Web Projects
    3. Select My Web Projects -> New Blank project
    4. To load the two data files, you need LOAD statements, and to store the resulting data you need a STORE statement, so in the end your pig script should look like this:
dept = LOAD 's3n://mortar-example-data/employee/dept.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage('SKIP_HEADER')
AS (deptno:int, dname:chararray, loc:chararray);

emp = LOAD 's3://mortar-example-data/employee/emp.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage('SKIP_HEADER')
AS (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, deptno:int);

-- Pig example code here

STORE avg_salary INTO 's3n://mortar-example-output-data/$MORTAR_EMAIL_S3_ESCAPED/emp_example’ USING PigStorage('\t');
  1. Click Run and select a 2-node cluster. Even with one million rows, it runs in under three minutes. Sit back and watch the power of modern computing tackle the problems of the 1980s.

You can also try out Pig in the Hortonworks Sandbox.

Apache Hadoop YARN Meetup II @Hortonworks

Introduction

Hortonworks hosted the second Apache Hadoop YARN meetup at Hortonworks office in Palo Alto on last Friday (22 February 2013). Following the success with the first one, this meetup continues to enjoy a good attendance from the YARN community. About 40 joined the meetup in person and nearly another 30 attended via phone/webex.

Meetup sessions

Update from Yahoo!

The Yahoo! grid team responsible for YARN rollout on their clusters gave an update of the current deployments and their state. Robert Evans and others from their team threw some very impressive numbers about the YARN clusters – 10s of million jobs till now on YARN, averaging ~100,000 jobs on some clusters per day. Please go ahead and read their recent blog on Yahoo! developer network: Hadoop at Yahoo!: More Than Ever Before. They then fielded several questions from the community like any pain-points for the users during the upgrade, big issues that only surfaced at scale. The software is deemed sufficiently stable, churning jobs out impressively and with maximum uptime with downtime mostly happening during upgrades.

Bikas Saha on ResourceManager restart functionality

After the update from Yahoo!, Bikas Saha from Hortonworks talked about the ResoureManager restart functionality. Most of his work is captured on the Apache JIRA issue YARN-128. The effort is divided into phases and the first phase involves:

  • Putting in place mechanisms to save application state and read them back after restart.
  • Upon restart, the NM’s are asked to reboot and the previously running AM’s are restarted.
  • AMs which support recovery automatically read back their own saved state on restart and try to resume work from before RM restart.

This first phase to restart all the running applications on RM recovery is done and shipped with the latest hadoop release 2.0.3-alpha. He discussed the overall design on a whiteboard, explaining the implementation.

YARN at LinkedIn

Chris Riccomini then talked a bit about what he continues to do with YARN (see his notes from last meetup).

Arun C Murthy on CPU scheduling in YARN

Arun talked about the enhancements to YARN resource scheduling to also account for CPU cores in addition to the memory based scheduling we already have. This effort is capture on Apache JIRA at YARN-2. Arun walked us through the DRF algorithm on which this work is based on, described various scheduling scenarios and summed it up with possible future directions.

Alejandro also gave a brief summary about adding support for CPU isolation/monitoring of containers. YARN-3 enhances YARN to use cgroups to control the cpu usage of containers. There is still a little work left to make this feature exposed to the end-users.

CPU scheduling and support for isolation via cgroups in YARN are both available in the most recent hadoop release 2.0.3-alpha. Both these features are big steps for YARN in realizing its goal of becoming the foremost generic resource management layer and making Hadoop the ‘distributed operating system’ on which rest of the data systems build on.

YARN progress and Roadmap

I did a quick recap of what we discussed in the last YARN meetup and what we’ve achieved so far. Few things, the community has delivered on its promises from last time:

Libraries for helping application writers: YARN-418 is the umbrella ticket for tracking this and we made quite some progress. YARN-29 helps application submission, YARN-103 is helpful to simply the usage of the AM RM protocol.
CPU isolation and scheduling: YARN-2 and YARN-3 are checked in as noted above
RM restart: The first phase to just restart running AMs and NMs is already in as part of YARN-128.

I then summed it up with our roadmap. The goal of YARN community for the next version of hadoop is to address some rough corners in YARN that are thwarting its march beyond its alpha use. Some areas of focus include:

  • RM restart: Finish testing RM restart at scale and progress toward the next phases
  • YARN usability: Address the minor and major usability issues with YARN client and web interfaces.
  • YARN API cleanup: Cleanup YARN APIs now itself to make them future proof and help us support backwards compatibility of stable APIs into the future
  • Security: YARN’s mostly been secure from the word get-go, a couple of minor things need to addressed to close this loop.

Conclusion

Thanks to everyone for making YARN meetups a continued success story. All help is welcome from the community to focus on solidifying our next release. Looking forward to meeting you all again at the next meetup!

Apache Pig 0.11 Released!

Apache Pig version 0.11 was released last week. An Apache Pig blog post summarized the release. New features include:

  • A DateTime datatype, documentation here.
  • A RANK function, documentation here.
  • A CUBE operator, documentation here.
  • Groovy UDFs, documentation here.

And many improvements. Oink it up for Pig 0.11! Hortonworks’ Daniel Dai gave a talk on Pig 0.11 at Strata NY, check it out:

Apache HBase 0.94.5 is out!

Last week, the HBase community released 0.94.5, which is the most stable release of HBase so far. The release includes 76 jira issues resolved, with 61 bug fixes, 8 improvements, and 2 new features.

Most of the bug fixes went against the REST server, replication, region assignment, secure client, flaky unit tests, 0.92 compatibility and various stability improvements. Some of the interesting patches in this release are:
[HBASE-3996] – Support multiple tables and scanners as input to the mapper in map/reduce jobs
[HBASE-5416] – Improve performance of scans with some kind of filters.
[HBASE-7757] – Add web UI to REST server and Thrift server
[HBASE-7748] – Add DelimitedKeyPrefixRegionSplitPolicy
[HBASE-6669] – Add BigDecimalColumnInterpreter for doing aggregations using AggregationClient
[HBASE-7728] – Deadlock occurs between hlog roller and blog syncer’

The release candidate has been extensively tested by Hortonworks and many others in the community. You can roll out the 0.94.5 bits using rolling upgrade on top of 0.92 or 0.94 releases. In addition, Apache HBase 0.94.5 will be incorporated into an upcoming update to HDP 1.2.

You can download the new release from here, and find full release notes here.

Last, but not least, we would like to thank Lars Hofhansl, who is the release manager of 0.94 branch for driving the release train, and all 30 individuals, who have contributed to this release.

The Fastest Path to Innovation: Community Driven Open Source

 

blogpicLast week, we outlined our approach for delivering an enterprise viable Apache Hadoop distribution in the open.  Simply put: we believe the fastest way to innovate is to do our work within the open source community, introduce enterprise feature requirements into that public domain, and to work diligently to progress existing open source projects and incubate new projects to meet those needs.

In support of our approach, this week we’ve announced the submission of two new incubation projects to the Apache Software foundation together with the launch of the “Stinger Initiative”, all aimed at enhancing the security and performance of Hadoop applications.  These efforts focus on enterprise requirements that are essential to enable broad adoption across the Hadoop ecosystem.

  • The Stinger initiative aims to dramatically speed up Apache Hive in support of interactive query use cases.
  • The Knox Gateway addresses the need for a single point of authentication and secure access for Apache Hadoop services in a cluster.
  • The Tez framework provides an alternative next-generation runtime built on Hadoop YARN that significantly improves latency and throughput of Hadoop applications.

We feel these efforts are strong examples of our commitment to driving innovation from within the open source community, and as stated in our approach blog, we do this by::

  • identifying and articulating the enterprise requirements within the community,
  • taking an active role in addressing those requirements within the community, and
  • applying enterprise rigor to the build, test and release process to ensure that the open source projects as well as the larger product distribution we provide is enterprise grade and interoperable with other elements in the enterprise.

Since it takes a community to build enterprise-class platforms like Hadoop, if you have interest in helping with Knox, Tez, or Stinger, we encourage you to work with us and the others in the Apache community!

Introducing… Tez: Accelerating processing of data stored in HDFS

 

MapReduce has served us well.  For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created.  While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns.  A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce.  A key step to enabling this new world was Apache YARN and today the community proposes the next step… Tez

What is Tez?

Tez – Hindi for “speed” – (currently under incubation vote within Apache) provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).

With the emergence of Apache Hadoop YARN as the basis of next generation data-processing architectures, there is a strong need for an application which can execute a complex DAG of tasks which can then be shared by Apache Pig, Apache Hive, Cascading and others.  The constrained DAG expressible in MapReduce (one set of maps followed by one set of reduces) often results in multiple MapReduce jobs which harm latency for short queries (overhead of launching multiple jobs) and throughput for large-scale queries (too much overhead for materializing intermediate job outputs to the filesystem). With Tez, we introduce a more expressive DAG of tasks, within a single application or job, that is better aligned with the required processing task – thus, for e.g., any given SQL query can be expressed as a single job using Tez.

The below graphic illustrates the advantages provided by Tez for complex SQL queries in Apache Hive or complex Apache Pig scripts.

pighivetez

Tez is critical to the Stinger Initiative and goes a long way in helping Hive support both interactive queries and batch queries. Tez provides a single underlying framework to support both latency and throughput sensitive applications, there-by obviating the need for multiple frameworks and systems to be installed, maintained and supported, a key advantage to enterprises looking to rationalize their data architectures. .

Essentially, Tez is the logical next step for Apache Hadoop after Apache Hadoop YARN. With YARN the community generalized Hadoop MapReduce to provide a general-purpose resource management framework (YARN) where-in MapReduce became merely one of the applications that could process data in your Hadoop cluster. With Tez, we build on YARN and our experience with the MapReduce to provide a more general data-processing application to the benefit of the entire ecosystem i.e. Apache Hive, Apache Pig etc.

What has been completed? Where can Tez go?

An early version of the project has been donated to the ASF as part of the initial code grant to establish the Incubation project.   Through the work done in the Stinger initiative, it is already clear that Tez enables and order of magnitude increase in the performance of Apache Hive.

The community is also designing a re-usable set of libraries of data-processing primitives such as sorting, merging, data-shuffling, intermediate data management etc. which are necessary for Tez and may be used directly by other projects.  This is just the beginning.  It is an extensible architecture that will undoubtedly be contributed to widely.

For the community, by the community

At Hortonworks we believe that innovation happens fastest by working with a community of like-minded individuals to address the requirements for Hadoop without being bounded by artificial boundaries such as employment. As such, even though the Hortonworks MapReduce/Hive/Pig team seeded the project, we’ve had the benefit of both positive feedback and constructive criticism from several leading contributors and committers across the Apache Hadoop MapReduce, Apache Hive & Apache Pig projects.  These inventors and peers are employed at Hortonworks, Yahoo, Facebook, Microsoft, Twitter and many others.  The initial committer list has 22 participants with deep domain expertise in these unique challenges and comprises a who’s who in the Hadoop world.  Of course, now that we are nearly in a position where we can co-develop via the Apache Software Foundation where we have proposed Tez as an Incubator project, we expect a very quick acceleration of project development.

When will it be available?

We plan to donate the code from our internal repository to the ASF as part of the Incubator proposal.  Also, Hortonworks will ship Tez in the next alpha release of Hortonworks Data Platform 2 (HDP2), based on Apache Hadoop 2.0, very soon to showcase some of the very exciting advances we have made for Apache Hive via Project Stinger.

We are very excited by the reception Tez has received so far, and we do hope you can join us in this initiative via the Apache Software Foundation project to make Hadoop better!

The Stinger Initiative: Making Apache Hive 100 Times Faster

 

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop.  Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface.  The who’s who of business analytics have already adopted Hive.

Apache Hive was originally built for large-scale operational batch processing and it is very effective with reporting, data mining and data preparation use cases.  These usage patterns remain very important but with widespread adoption of Hadoop, the enterprise requirement for Hadoop to become more real time or interactive has increased in importance as well. At Hortonworks, we believe in the power of the open source community to innovate faster than any proprietary offering and the Stinger initiative is proof of this once again as we collaborate with others to improve Hive performance.

So, What is Stinger?

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

To this end, we have launched the Stinger Initiative, with input and participation from the broader community, to enhance Hive with more SQL and better performance for these human-time use cases. All the while, HiveQL remains the same before and after these advancements so it just gets better. And in keeping with the ecosystem of existing tools, it is complementary to best-of-breed data warehouses and analytic platforms.

  • stingerRoadFirst, we are making Hive a more suitable tool for the decision support queries people want to perform on Hadoop.  This includes adding analytics features like the OVER clause, support for subqueries in WHERE, and aligning Hive’s type system more with the standard SQL model.
  • Second, we are optimizing Hive’s query execution plans and based on our initial changes, we have already seen query time drop by 90% on some of our test queries. We are also looking at additional changes inside Hive’s execution engine that we believe will significantly increase the number of records per second that a Hive task can process.
  • Third, we have introduced a new columnar file format (i.e. ORCFile) within the Hive community to provide a more modern, efficient, and high performing way to store Hive data.
  • And lastly, we’ve introduced a new runtime framework, called Tez, which aims to eliminate Hive’s latency and throughput constraints that result from its reliance on MapReduce. Tez optimizes Hive job execution by eliminating unnecessary tasks, synchronization barriers, and reads from and write to HDFS.  This optimizes the execution chain within Hadoop and drastically speeds Hive’s workload processing.

All of these modifications to Hive are underway in the open and an initial preview will be available in advance of Hadoop Summit Amsterdam in March.

Embrace the community, Embrace Hive…

A diverse group of individuals within the Hive community are collaborating on these efforts. As part f the community, a wide group of people contributed to this effort, including resources from SAP, Microsoft, Facebook and Hortonworks.

Harish Butani from SAP has led an effort to add analytics and windowing functions to Hive.  This will add the OVER clause for use with existing aggregate functions as well as adding analytics functions like RANK and NTILE and windowing functions like LEAD and LAG; you can see this work at HIVE-896.  Namit Jain from Facebook has been spending a lot of time lately optimizing Hive’s query execution planning so that it performs joins and other operations more efficiently and with less need for hints from the user.  Hortonworks engineers have been collaborating on these and other community efforts to improve Hive.

Owen O’Malley, a Hortonworks co-founder and early Hadoop developer, has been working with Facebook on the new ORCFile in order to greatly improve performance when Hive is reading, writing, and processing data; you can see this work at HIVE-3874. We are also working on farther reaching changes and optimizations such as reworking Hive’s operators to process records in blocks of a thousand or more and thus be much more efficient than it is today.

We believe the performance changes we are making today, along with the work being done in Tez will transform Hive into a single tool that Hadoop users can use to do report generation, ad hoc queries, and large batch jobs spanning 10s or 100s of terabytes.

Why reinvent the wheel?

Go to page:12345...10...Last »