Category Archives: HCatalog


Hadoop SDK and Tutorials for Microsoft .NET Developers

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

You can read more about the partnership between Hortonworks and Microsoft here, and you can download a preview of HDP for Windows here, or sign up for HDInsight over here. And if you’re hungry for more Hadoop tutorials, grab our own Hortonworks Sandbox here.

Hive/HCatalog – Data Geeks & Big Data Glue

Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it…  we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it.  Within Hadoop, this critical metadata service is provided by HCatalog.

As a key component of Apache Hive, HCatalog is a metadata and table management system for the broader Hadoop platform. It enables the storage of data in any format regardless of structure. Hadoop can then process both structured and unstructured data and it can store and share information about data’s structure in HCatalog. This capability combined with the ‘schema on read’ nature of Hadoop versus traditional EDW ‘schema on write’ reduces cycle time for data scientists seeking insight as it encourages exploration and discovery on a continuous basis.

Similarly, Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools. It is the glue that enables these systems to interact effectively and efficiently and is a key component in helping Hadoop fit into the enterprise.

SQL Interface for Hadoop? HCatalog as enabler…

Since 2008, Hive has reigned as the defacto SQL interface for Hadoop as it provides a relational view through SQL like language to data within Hadoop. HCatalog publishes this same interface but abstracts it for data beyond Hive.  It also publishes a REST interface for external use so that your existing tools can interact with Hadoop in the way you expect… via ODBC and JDBC into SQL!

Good for the ecosystem is good for you

HCatalog intends to enable the ecosystem to more general SQL interaction to Hadoop. Our partners are building dedicated interfaces on top of this key interaction point to drive a Hadoop strategy within their products.  For instance, Teradata has created SQL-H on top of HCatalog as their default interface to Hadoop, enabling their users to query across this big data resource from existing tools. So now, as performance enhancements of Hive through the Stinger initiative progresses, their tools get better and better.

Hadoop Developer productivity and HCatalog

HCatalog also allows developers to share data and metadata across internal Hadoop tools such as Hive, Pig, and MapReduce. It allows them to create applications without being concerned how or where the data is stored, and insulates users from schema and storage format changes.  It is a repository for schema that can be referred to in these programming models so that you don’t have to explicitly type your structures in each program. It provides a command line tool for users who do not use Hive to operate on the metastore with Hive DDL statements.  It also provides a notification service so that workflow tools, such as Oozie, can be notified when new data becomes available in the warehouse.

HCatalog in Use

So how might you use HCatalog? Organizations today are using HCatalog in a variety of different ways, however, the key uses could be summarized as the following:

  • Enabling the Right Tool for the Right Job
    The majority of heavy Hadoop users do not use a single tool for data processing.  Often users and teams will begin with a single tool:  Hive, Pig, MapReduce, or another tool.  As their use of Hadoop deepens they will discover that the tool they chose is not optimal for the new tasks they are taking on.  Users who start with analytics queries using Hive discover they would like to use Pig for ETL processing or constructing their data models.  Users who start with Pig discover they would like to use Hive for analytics type queries.  While tools such as Pig and MapReduce do not require metadata, they can benefit from it when it is present.  Sharing a metadata store also enables users across tools to share data more easily.  A workflow where data is loaded and normalized using Map Reduce or Pig and then analyzed via Hive is very common.  When all these tools share one metastore users of each tool have immediate access to data created with another tool.  No loading or transfer steps are required.
  • Capture Processing States to Enable Sharing
    When used for analytics, users will discover information using Hadoop.  Again, they will often use Hive, Pig and Map Reduce to uncover information.  The information is valuable but typically only in the context of a larger analysis.  With HCatalog you can publish results so they can be accessed by your analytics platform via REST.  In this case, the schema defines the discovery. These discoveries are also useful to other data scientists.  Often they will want to build on what others have created or use results as input into a subsequent discovery.
  • Integrate Hadoop with everything
    Hadoop as a processing and storage environment opens up a lot of opportunity for the enterprise; however, to fuel adoption it must work with and augment existing tools.  Hadoop should serve as input into your analytics platform or integrate with your operational data stores and web applications.  The organization should enjoy the value of Hadoop without having to learn an entirely new toolset.  REST services opens up the platform to the enterprise with a familiar API and SQL-like language.  Enterprise data management systems use HCatalog to more deeply integrate with the Hadoop platform. By tieing in more closely they can hide complexity from users and create a better experience. A great example of this is the SQL-H integration from Teradata Aster. SQL-H queries the structure of data stored in HCatalog and exposes that back through to Aster enabling Aster to access just the relevant data stored within the Hortonworks Data Platform.

HCatalog is just one of many components of Apache Hadoop and the Hortonworks Data Platform. You can find out more here, including further integration points, and how Hortonworks provides the enterprise rigor to Apache Hadoop.

Imperative and Declarative Hadoop: TPC-H in Pig and Hive

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.

TPC-H was implemented for Hive in HIVE-600 and for Pig in PIG-2397 by Hortonworks intern Jie Li. In going over this work, I was struck by how it outlined differences between Pig and SQL.

There seems to be tendency for simple SQL to provide greater clarity than Pig. At some point as the TPC-H queries become more demanding, complex SQL seems to have less clarity than the comparable Pig. Lets take a look.

Q1, the pricing summary report, is fairly simple, and a SQL GROUP BY is a good fit:

DROP TABLE lineitem;
DROP TABLE q1_pricing_summary_report;

-- create tables and load data
Create external table lineitem (
    L_ORDERKEY INT, L_PARTKEY INT, 
    L_SUPPKEY INT, 
    L_LINENUMBER INT, 
    L_QUANTITY DOUBLE, 
    L_EXTENDEDPRICE DOUBLE, 
    L_DISCOUNT DOUBLE, 
    L_TAX DOUBLE, 
    L_RETURNFLAG STRING, 
    L_LINESTATUS STRING, 
    L_SHIPDATE STRING, 
    L_COMMITDATE STRING, 
    L_RECEIPTDATE STRING, 
    L_SHIPINSTRUCT STRING, 
    L_SHIPMODE STRING, 
    L_COMMENT STRING) 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/lineitem';

-- create the target table
CREATE TABLE q1_pricing_summary_report ( 
    L_RETURNFLAG STRING, 
    L_LINESTATUS STRING, 
    SUM_QTY DOUBLE, 
    SUM_BASE_PRICE DOUBLE, 
    SUM_DISC_PRICE DOUBLE, 
    SUM_CHARGE DOUBLE, 
    AVE_QTY DOUBLE, 
    AVE_PRICE DOUBLE, 
    AVE_DISC DOUBLE, 
    COUNT_ORDER INT);

set mapred.min.split.size=536870912;

-- the query
INSERT OVERWRITE TABLE q1_pricing_summary_report 
SELECT 
    L_RETURNFLAG, 
    L_LINESTATUS, 
    SUM(L_QUANTITY), 
    SUM(L_EXTENDEDPRICE), 
    SUM(L_EXTENDEDPRICE * (1-L_DISCOUNT)), 
    SUM(L_EXTENDEDPRICE * (1-L_DISCOUNT) * (1+L_TAX)), 
    AVG(L_QUANTITY),
    AVG(L_EXTENDEDPRICE), 
    AVG(L_DISCOUNT), 
    COUNT(1) 
FROM 
  lineitem 
WHERE 
  L_SHIPDATE<='1998-09-02' 
GROUP BY L_RETURNFLAG, L_LINESTATUS 
ORDER BY L_RETURNFLAG, L_LINESTATUS;

One thing to notice, though, that compared to Pig we have to specify schemas twice – once for the load, and again for the result. Compare that to the Pig, where we specify the schema once upon load, and then implicitly in Pig code itself:

 SET default_parallel $reducers;

LineItems = LOAD '$input/lineitem' USING PigStorage('|') AS (
    orderkey:long, 
    partkey:long, 
    suppkey:long, 
    linenumber:long, 
    quantity:double, 
    extendedprice:double, 
    discount:double, 
    tax:double, 
    returnflag, 
    linestatus, 
    shipdate, 
    commitdate, 
    receiptdate, 
    shipinstruct, 
    shipmode, 
    comment);

SubLineItems = FILTER LineItems BY shipdate <= '1998-09-02';

SubLine = FOREACH SubLineItems GENERATE 
    returnflag, 
    linestatus, 
    quantity, 
    extendedprice, 
    extendedprice * (1-discount) AS disc_price, 
    extendedprice * (1-discount) * (1+tax) AS charge, 
    discount;

StatusGroup = GROUP SubLine BY (returnflag, linestatus);

PriceSummary = FOREACH StatusGroup GENERATE 
    group.returnflag AS returnflag, 
    group.linestatus AS linestatus, 
    SUM(SubLine.quantity) AS sum_qty, 
    SUM(SubLine.extendedprice) AS sum_base_price, 
    SUM(SubLine.disc_price) as sum_disc_price, 
    SUM(SubLine.charge) as sum_charge, AVG(SubLine.quantity) as avg_qty, 
    AVG(SubLine.extendedprice) as avg_price, 
    AVG(SubLine.discount) as avg_disc, 
    COUNT(SubLine) as count_order;

SortedSummary = ORDER PriceSummary BY returnflag, linestatus;

STORE SortedSummary INTO '$output/Q1out';

Things change as the queries get more complex. With the use of temporary tables, the schema creation overhead starts to dominate, and the SQL becomes quite complex. Take a look at Q22, the Global Sales Opportunity Report:

DROP TABLE customer;
DROP TABLE orders;
DROP TABLE q22_customer_tmp;
DROP TABLE q22_customer_tmp1;
DROP TABLE q22_orders_tmp;
DROP TABLE q22_global_sales_opportunity;

-- create tables and load data
create external table customer (
    C_CUSTKEY INT, 
    C_NAME STRING, 
    C_ADDRESS STRING, 
    C_NATIONKEY INT, 
    C_PHONE STRING, 
    C_ACCTBAL DOUBLE, 
    C_MKTSEGMENT STRING, 
    C_COMMENT STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/customer';

create external table orders (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE STRING, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION '/tpch/orders';

-- create target tables
create table q22_customer_tmp(c_acctbal double, c_custkey int, cntrycode string);
create table q22_customer_tmp1(avg_acctbal double);
create table q22_orders_tmp(o_custkey int);
create table q22_global_sales_opportunity(cntrycode string, numcust int, totacctbal double);

-- the query
insert overwrite table q22_customer_tmp
select 
  c_acctbal, c_custkey, substr(c_phone, 1, 2) as cntrycode
from 
  customer
where 
  substr(c_phone, 1, 2) = '13' or
  substr(c_phone, 1, 2) = '31' or
  substr(c_phone, 1, 2) = '23' or
  substr(c_phone, 1, 2) = '29' or
  substr(c_phone, 1, 2) = '30' or
  substr(c_phone, 1, 2) = '18' or
  substr(c_phone, 1, 2) = '17';

insert overwrite table q22_customer_tmp1
select
  avg(c_acctbal)
from
  q22_customer_tmp
where
  c_acctbal > 0.00;

insert overwrite table q22_orders_tmp
select 
  o_custkey 
from 
  orders
group by 
  o_custkey;

insert overwrite table q22_global_sales_opportunity
select
  cntrycode, count(1) as numcust, sum(c_acctbal) as totacctbal
from
(
  select cntrycode, c_acctbal, avg_acctbal from
  q22_customer_tmp1 ct1 join
  (
    select cntrycode, c_acctbal from
      q22_orders_tmp ot 
      right outer join q22_customer_tmp ct 
      on
        ct.c_custkey = ot.o_custkey
    where
      o_custkey is null
  ) ct2
) a
where
  c_acctbal > avg_acctbal
group by cntrycode
order by cntrycode;

The Pig is comparably simple:

 SET default_parallel $reducers;

customer = load '$input/customer' USING PigStorage('|') as (
    c_custkey:long,
    c_name:chararray, 
    c_address:chararray, 
    c_nationkey:int, 
    c_phone:chararray, 
    c_acctbal:double, 
    c_mktsegment:chararray, 
    c_comment:chararray);
orders = load '$input/orders' USING PigStorage('|') as (
    o_orderkey:long, 
    o_custkey:long, 
    o_orderstatus:chararray, 
    o_totalprice:double, 
    o_orderdate:chararray, 
    o_orderpriority:chararray, 
    o_clerk:chararray, 
    o_shippriority:long, 
    o_comment:chararray);

customer_filter = filter customer by c_acctbal>0.00 and SUBSTRING(c_phone, 0, 2) MATCHES '13|31|23|29|30|18|17';
customer_filter_group = group customer_filter all;
avg_customer_filter = foreach customer_filter_group generate AVG(customer_filter.c_acctbal) as avg_c_acctbal;

customer_sec_filter = filter customer by c_acctbal > avg_customer_filter.avg_c_acctbal and SUBSTRING(c_phone, 0, 2) MATCHES '13|31|23|29|30|18|17';
customer_orders_left = join customer_sec_filter by c_custkey left, orders by o_custkey;

customer_trd_filter = filter customer_orders_left by o_custkey is null;
customer_rows = foreach customer_trd_filter generate SUBSTRING(c_phone, 0, 2) as cntrycode, c_acctbal;

customer_result_group = group customer_rows by cntrycode;
customer_result = foreach customer_result_group generate group, COUNT(customer_rows) as numcust, SUM(customer_rows.c_acctbal) as totacctbal;
customer_result_inorder = order customer_result by group;

store customer_result_inorder into '$output/Q22out' USING PigStorage('|');

Both Pig and Hive have a place, and their own strengths. It is illustrative to compare these identical queries in the two systems, to see where you might want to handoff queries from Hive to Pig. HCatalog facilitates this handoff – as Pig can read directly from the Hive tables in via HCatLoader and HCatStorer.

Hortonworks Sandbox — the Fastest On Ramp to Apache Hadoop

Go from Zero to Big Data in 15 Minutes!

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

  • A free download
  • A complete, self contained virtual machine with Apache Hadoop pre-configured
  • A personal, portable and standalone Hadoop environment
  • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop. The Hortonworks Sandbox provides a powerful combination of hands-on, step-by-step tutorials paired with an easy to use Web interface designed to lower the learning curve for people who just want to explore and evaluate Hadoop, as quickly as possible.

One of our key focus areas is enabling Hadoop as an enterprise-viable platform that is easy to use and consume by our customers and the broader ecosystem. Over the past year or so, we have seen the complex and disjointed experience people face trying to learn Hadoop, and with the Sandbox, it allows you to have the fastest onramp to Apache Hadoop. We want the Sandbox to deliver an integrated, easy-to-use, easily updateable learning environment. Ongoing updates to the tutorials are planned, delivering new, interesting hands-on exercises, exploring different features and use cases.

These tutorials are built based on the experience gained training thousands of people in our Hortonworks University Training classes. As we continue to build out the Sandbox, we will provide additional levels of sophistication – think of it as the Hadoop 101, 201 and 301 levels of learning. And, the process of updating the tutorials is easy through the click of the “Update” button, initiating a lightweight download of just the tutorial content.

The Sandbox is a single node implementation of the Hortonworks Data Platform (HDP) 1.2 that behaves just like a normal Hadoop environment, which allows you to add your own datasets in an isolated protected environment to evaluate the use of Hadoop in your own data architectures.

Use the Sandbox to:

  • Explore Hadoop on your own
  • Plan out the integration points of your proof of concept project
  • Prepare for a more complex pilot deployment

When you are ready, you can download and deploy the Hortonworks Data Platform with the confidence that you have thought through exactly how and where Hadoop can help.

What can you expect from us in the coming months with the Hortonworks Sandbox?

  1. Join us for a special launch webinar on February 5, “Go from Zero to Big Data in 15 Minutes“. I will be hosting this webinar with one of our awesome Solution Engineers who will give you a sneak peek at some cool use cases for the Sandbox.
  2. New tutorials released on roughly a monthly basis.
  3. Demos and exercises of the integration with the tools and applications from our eco-system partners like Teradata, Alteryx, Datameer, and Microsoft. How cool would it be to run Excel on top of a personal Hadoop environment?? Well, that’s coming, so check back often.

I’m excited that you will be able to go from Zero to Big Data in 15 Minutes in a simple, easy-to-use fashion. And, I’m eager to hear your feedback – please let me know what you think of the Sandbox, what kinds of tutorials you would like to see and I would love to hear about your creative uses of the Sandbox. Leave your comments on this blog, Tweet out using #hwsandbox, comment in the Sandbox Forum, or email. The Hortonworks Sandbox is free and available for download here.

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution.  Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL.  This is a great approach but it lacks integration of metadata and metadata exchange.  With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product.  SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata.  Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze.  All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration
In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics.  With this package, these valuable functions can now be applied to big data in Hadoop.  This shortens the time it takes for an analyst to explore and discover value in big data.  And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Lighting up operations
Often overlooked when an organization considers Hadoop is the impact on IT operations.  They are tasked with making sure a cluster is functional.  Well, these guys have countless tools to perform their job and for Teradata they use Viewpoint Teradata Vital Infrastructure.  In this release, we have integrated the management and monitoring communications use by Ambari with these monitoring tools. Now, the ops guy has a true single pane of glass to monitor the Teradata environment AND the Hadoop cluster used to provide the big data analytics.

Some details on the appliance
The Teradata Aster Big Analytics Appliance runs on proven Teradata hardware, leverages the most current Intel® processor chip technology, SUSE® Linux operating system, and market-leading enterprise-class storage. It can be configured to store a maximum of 5 petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data for Hadoop.

“The Teradata Aster Big Analytics Appliance offers the faster path from diverse big data acquisition to big insights, and seamlessly delivers these insights to the business owners. Unmatched by any other stack in the industry, it enables organizations to overcome the barriers to big data analytics and provides a high-definition view of the business to optimize operations.”– Scott Gnau, president, Teradata Labs.

This is unique and it ushers in a new approach to big data analytics.

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group)

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

Thanks Mark, and anytime you would like us to come to the windy city, let us know! For those of you who couldn’t be there, I have a treat for you, the recording!

Thanks Chicago Hadoop Community! Stay Classy!

HCatalog Meetup at Twitter

Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

A central theme was using HCatalog to enable sharing and use of legacy data and diverse formats like TSV, JSON, RCFile, Protobuf, Thrift and Avro, among diverse tools like Pig, Hive, Cascading, SQL-H and JAQL.

A key issue discussed were the mechanics of HCatalog’s integration with Hive as the project develops and matures. Some HCatalog users use Hive, and some do not – but HCatalog relies on the Hive metastore regardless. As usual in open source, each organization has its own set of problems, perspectives and priorities, and the discussion centers around commonalities in finding a common path forward.

One thing was clear: HCatalog is HOT! An increasing number of organizations are adopting HCatalog for managing data and systems integration around Hadoop.

Meet the Committer, Part One: Alan Gates

Series Introduction

Alan Gates, Founder & Architect, Collectible Trading Card

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

Get to know Alan in this first installment of our “Meet the Committer” series.

Kim: Tell us about your current role and how you interact with Apache Hadoop projects?

Alan: I wear a number of different hats.  I lead the team at Hortonworks that works on Pig, Hive, and HCatalog.  I was one of the original committers on the Pig project when it started in Apache 5 years ago, and am still an active member of the community.  I am also an active member of the HCatalog project.  As an Apache member and part of the Apache Incubator I mentor HCatalog, Bigtop, and Oozie.  This means I help those projects grow into top-level projects in Apache, mentoring them in the Apache way.

Kim: How did the Pig project come about?

Alan: Pig was started as a project in Yahoo! research.  It was originally referred to simply as “the language”.  One day one of the researchers said, “We need a name for this” and someone said, “How about Pig?”  It stuck.  After Yahoo! users began using Pig it was clear it was valuable.  Yahoo! decided to invest in making it a production quality project.  That’s when Olga Natkovich and I were brought into the project. We open sourced the project via the Apache Incubator, beefed it up to production quality, and started adding new features.

Kim: Can you provide a sneak peek of your presentation and what do you expect will be key take-away for folks attending this webinar?

Alan: I want to focus on a couple of things in the presentation.  One, Pig 0.10 has added some exciting features like UDFs in JRuby and Boolean data type as well as many language enhancements and performance improvements.  A lot of work is going into Pig now, especially with our six Google Summer of Code students pouring in new features.  I will also talk some about changes we would like to make in Pig to take advantage of new features available in Hadoop 2.0.  I hope the key take away will be different for each listener; hopefully it will be something new they did not know about Pig that will help them use it more effectively.

Kim: Who would win in a fight? Piglet or Miss Piggy?

Alan: This one’s easy.  While Piglet was busy trying to explain that he was a very small animal and hence not given to fighting Miss Piggy would give him one of her feared karate chops and it would all be over.

I hope you would join us on September 12, 2012 @10am PDT / 1pm EDT to “Pig Out to Hadoop” with Alan Gates.

In the next few weeks we will be joined by other committers and Hadoop experts, including: Matt Foley, Mahadev Konar, and Arun C. Murthy. For more information and to register, go here: http://info.hortonworks.com/FutureofHadoopSeries.html

The Data Lifecycle, Part Three: Booting HCatalog on Elastic MapReduce

Series Introduction

This is part three of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re exploring the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in Hive, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

  • Series Part One: Avroizing the Enron Emails. In that post, we used Pig to extract, transform and load a MySQL database of the Enron emails to document format and serialize them in Avro.The Enron emails are available in Avro format here.
  • Series Part Two: Mining Avros with Pig, Consuming Data with Hive. In part two of the series, we extracted new and interesting properties from our data for consumption by analysts and users, using Pig, EC2 and Hive.Code examples for this post are available here: https://github.com/rjurney/enron-hcatalog.
  • Series Part Three: Booting HCatalog on Elastic MapReduce. Here we will use HCatalog to streamline the sharing of data between Pig and Hive, and to aid data discovery for consumers of processed data.

Read More

Hortonworks Data Platform v1.0 Download Now Available

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

Read More

Introducing Hortonworks Data Platform v1.0

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Read More

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Read More

Apache HCatalog 0.4.0 Released

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

The highlights of the 0.4.0 release include:

- Full support for reading from and writing to Hive.
- Support for deeply nested maps, arrays, and structs.
- Switch from StorageDrivers to SerDes. HCatalog no longer supports its own StorageDriver classes for data (de)serialization. Instead it uses Hive’s SerDe classes.
- Addition of JSonSerDe to support reading and writing JSON data.
- The HCatalog binary distribution no longer includes Apache Hive. We now require that Hive first be installed.
- The HCatalog source distribution no longer includes Apache Hive source. It now pulls the required jars via Maven.

The details of the release can be found here.

~ Alan Gates

Executive Video Series: Introduction to HCatalog

We just added a video to the Hortonworks Executive Video library that features Alan Gates, Hortonworks co-founder and Apache PMC member. In this video, Alan discusses HCatalog, one of the most compelling projects in the Apache Hadoop ecosystem.

HCatalog is a metadata and table management system that provides a consistent data model and schema for users of tools such as MapReduce, Hive and Pig. When you consider that there are often users accessing Hadoop clusters using different tools that independently don’t agree on schema, data types, how and where data is stored, etc., then you can understand the value of having a tool such as HCatalog.

In this video, Alan does a good job of not only explaining the role of HCatalog, but also laying out the future direction of the project. He talks about improving the integration with HBase, improving information lifecycle management and expanding the HCatalog data model to address the challenges of unstructured data.