Category Archives: Hadoop Ecosystem


An Advance Look at Hadoop Summit

Hadoop Summit is just around the corner and by that, I mean next week! There is still time to register for the conference but please do it soon as the conference is filling up quickly. Today is also the last day in which online registration will remain open. After today, you will need to register on-site at the conference itself.

This year’s Hadoop Summit conference, now in its fifth year, promises to be the biggest and best yet. In fact, there are already more people registered for Hadoop Summit 2012 than any other Hadoop conference ever!

I wanted to take this opportunity share some of the highlights for next week’s conference:

Geoffrey Moore and Other Compelling Keynote Speakers:

Geoffrey Moore, author of “Crossing the Chasm” and “Escape Velocity”, will share his views on “Digitizing the World, the Driving Force Behind Apache Hadoop’s Adoption Life Cycle”. You will also hear from other industry luminaries, who will share their vision for where Apache Hadoop is going and how it is destined to become the foundation for the next generation enterprise data platform.

Read More

Balancing Community Innovation and Enterprise Stability

Having worked at JBoss and Red Hat from 2004 to 2008 and SpringSource and VMware from 2008 to 2011, I’ve been focused on the world of open source software for a long while. I’ve been blessed to be able to serve enterprise customer needs with high quality open source software such as JBoss Application Server, Hibernate, Drools, Apache Web Server, Apache Tomcat, Spring … and now Apache Hadoop.

As specific open source technologies mature and their use becomes mainstream, it becomes increasingly important to understand and communicate the balancing act that needs to happen between community innovation and enterprise stability.

Community innovation needs to have a fast pace, where “ship early and often” is a key tenet.  Open source projects need to visibly improve and keep innovating if they are to attract a vibrant following. As the open source project’s community grows, they will expect big improvements and will be fine with early, buggy releases, etc. After all, that’s part of the process

Read More

The Data Lifecycle, Part Two: Mining Avros with Pig, Consuming Data with HIVE

Series Introduction

This is part two of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

Part one of this series is available here.

Code examples for this post are available here: https://github.com/rjurney/enron-hive.

In the last post, we used Pig to Extract-Transform-Load a MySQL database of the Enron emails to document format and serialize them in Avro. Now that we’ve done this, we’re ready to get to the business of data science: extracting new and interesting properties from our data for consumption by analysts and users. We’re also going to use Amazon EC2, as HIVE local mode requires Hadoop local mode, which can be tricky to get working.

Read More

The Data Lifecycle, Part One: Avroizing the Enron Emails

Series Introduction

This is part one of a series of blog posts covering new developments in the Hadoop pantheon that enable productivity throughout the lifecycle of big data.  In a series of posts, we’re going to explore the full lifecycle of data in the enterprise: Introducing new data sources to the Hadoop filesystem via ETL, processing this data in data-flows with Pig and Python to expose new and interesting properties, consuming this data as an analyst in HIVE, and discovering and accessing these resources as analysts and application developers using HCatalog and Templeton.

The Berkeley Enron Emails

In this project we will convert a MySQL database of Enron emails into Avro document format for analysis on Hadoop with Pig. Complete code for this example is available on here on github.

Email is a rich source of information for analysis by many means. During the investigation of the Enron scandal of 2001, 517,431 messages from 114 inboxes of key Enron executives were collected. These emails were published and have become a common dataset for academics to analyze document collections and social networks. Andrew Fiore and Jeff Heer at UC Berkeley have cleaned this email set and provided it as a MySQL archive.

Read More

Big Data Refinery Fuels Next-Generation Data Architecture

Since joining Hortonworks at the beginning of the year, a question I’ve heard over and over again is “What is Apache Hadoop and what is it used for?”

There’s clearly a lot of hype [and confusion] in this emerging Big Data market, and it feels as if each new technology, as well as existing technologies, are pushing the meme of all your data are belong to us. It is great to see the wave of innovation occurring across the landscape of SQL, NoSQL, NewSQL, EDW, MPP DBMS, Data Marts, and Apache Hadoop (to name just a few), but enterprises and the market in general can use a healthy dose of clarity on just how to use and interconnect these various technologies in ways that benefit the business.

In my post entitled 7 Key Drivers for the Big Data Market, I asserted that the Big Data movement is not only about the classic world of transactions, but it factors in the new(er) worlds of interactions and observations. This new world brings with it a wide range of multi-structured data sources that are forcing a new way of looking at things.

Read More

Record Support for Hadoop Summit

In case you didn’t see the news today, Hadoop Summit announced record ecosystem support for this year’s conference. The original and world’s largest Apache Hadoop conference, now in its fifth year, is being sponsored this year by more than 40 traditional and open source software and services companies.

Hortonworks and our co-host Yahoo! would like to thank the following companies for helping to make Hadoop Summit possible:

Read More

Hadoop Observations from the U.K.

As part of Big Data Week, Dan Harvey of the London Hadoop User Group organised an afternoon session for the usergroup, which we were glad to sponsor, along with Canonical and Facegroup. I had the pleasure of presenting my view of the current and future status of Apache Hadoop to an audience that ranged from those curious about Hadoop to heavy users.

Every talk of the day was excellent, from the use cases by Datasift, Mendeley and MusicMetric, to the talk by Francine Bennett of MastodonC on the CO2 footprint of different cloud computing infrastructures, including a live dashboard on the current CO2/hour of many cloud infrastructure sites.

In my discussions with attendees, I was impressed how broadly Hadoop is starting to be adopted in the U.K. There is adoption from “pure data” companies like Mendeley, DataSift, MusicMatch, Last.fm, as well as media companies and financial organisations. London is a centre of finance and data and as such, from a Hadoop perspective, it is a source of data waiting to be stored and mined.

Read More

Hadoop Summit Community Choice

As I first mentioned when we announced Hadoop Summit 2012, we are focused on making Hadoop Summit the preeminent conference for the Apache Hadoop community. Today I’m pleased to tell you about Community Choice, a public online voting system that enables the entire Apache Hadoop community to have a say in the sessions chosen for Hadoop Summit. Anybody can vote and the top vote getters in each track will automatically be included in the Hadoop Summit agenda.

One of the things you will notice when you vote is the large number of abstracts that were submitted for the conference. In fact, there were 267 submissions for Hadoop Summit, more than double the number of submissions from last year’s highly successful event. There are six tracks; each of which has a wide selection of compelling topics. Another interesting fact is that there were submissions from 120 different organizations (companies, universities and government agencies). It’s becoming even clearer that Apache Hadoop is having a significant impact in the data industry.

In addition to Community Choice, there is also a content selection committee in place that will identify the other sessions for Hadoop Summit. This is also a community effort. The content selection committee is made up of 36 leaders from the ecosystem representing 27 different organizations (vendors, end users and universities). The committee is hard at work reviewing sessions and we expect to be able to publish the final agenda before the end of March.

Please remember to vote in the Community Choice process. If you ever wanted to have input into a conference, this is your chance. Voting ends March 20th, so please vote today.

~E14

Open Source Data Integration for Apache Hadoop

Today we announced an important strategic partnership with Talend, provider of the world’s most popular open source data integration platform. This is another win for both Hortonworks customers and the larger Apache Hadoop community. There were two key aspects of the announcement that I wanted to highlight:

Talend releases Talend Open Studio for Big Data

Based upon Talend’s very popular open source data integration platform, Talend Open Studio for Big Data adds connectors for HDFS, HBase, Pig, Sqoop and Hive. It allows organizations to move data into and out of Hadoop much more easily. It also leverages the MapReduce architecture to generate native Hadoop code and run data transformations directly inside Hadoop, in a highly scalable fashion. Talend Open Studio for Big Data will also be released with Apache licensing, which is a good match for the Apache Hadoop community.

Read More

Extending Apache Hadoop to Millions of New Microsoft Users

Today we announced  that we were delivering on our earlier promise to help Microsoft bring Apache Hadoop to Windows. I’m pleased to share that Microsoft, with our collaboration and guidance, has now submitted a series of patches to Apache aimed at overcoming the challenges of running Apache Hadoop in Windows Server environments.

These patches, once vetted and approved by the community, will become part of the core Hadoop code base. They will also become available in the two major Apache Hadoop branches: hadoop-1.0 (the current stable branch, which is available as part of Hortonworks Data Platform v1.0) and hadoop-0.23 (the next generation of Apache Hadoop, which will be available as part of Hortonworks Data Platform v2.0).

Read More

The Importance of the Teradata & Hortonworks Partnership

Hortonworks and Teradata announced a strategic relationship today that includes joint go-to-market and development work to more closely integrate Hortonworks Data Platform with the Teradata Analytical Ecosystem. I wanted to take the opportunity to highlight this important partnership and share my thoughts on why this is an important milestone for Hortonworks and the larger Apache Hadoop community.

As somebody that has been heavily involved in the development of Apache Hadoop for six years and counting, it’s personally exciting to see Hadoop entering a new phase of adoption. Hadoop has been heavily used in organizations such as Yahoo!, Facebook, Linked In and other large web properties since 2006. Over the past couple of years, we’ve seen a surge in the number of organizations testing Hadoop in proof-of-concept or pilot projects but it hasn’t yet reached massive adoption in production in the enterprise.

Read More

Hadoop Summit 2012 is Coming

Hi Folks,

I’m happy to report that Hadoop Summit will be back for it’s 5th year. This year, Hortonworks and Yahoo are jointly hosting the conference, which will take place on June 13th and 14th at the San Jose Convention Center.

This year’s event promises to be bigger and better than ever. We have extended the conference to a second day, added additional session tracks and expect to showcase even more compelling and useful presentations. You will be really impressed when you see what we have planned.

Read More

Apache Hadoop Meets Informatica Data Parsing

As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard.

For those new to Apache Hadoop, MapReduce is a parallel computing framework for processing large volumes of data. It deals with the four V’s of big data (as Forrester described) that present challenges to existing data systems, namely: volume, velocity, variety and variability. Together with the Hadoop Distributed File System (HDFS) and a handful of other important Apache Hadoop projects, it provides a massively scalable and highly reliable platform for storing, processing, managing and ultimately analyzing the ever-increasing data coming not only from transactional systems but also unstructured data in the form of server logs, customer interaction records, social media updates, email, PDFs, CDRs and so forth.

Read More

The Why’s Behind the Microsoft and Hortonworks Partnership

If when we started building an Apache Hadoop team at Yahoo!, someone had told me that in the future we would partner with Microsoft to improve Hadoop’s performance on Windows, I would have found the prediction hard to believe. The first time a Microsoft executive suggested that they would like to work with us to improve Apache Hadoop, I told them I found their proposal “mind-bending”. I also told them that if we could do it the right way, I liked the idea. Our core mission is to bring Apache Hadoop to the widest possible user base and Windows and SQL Server have a very large user bases.

Why is adding a fraction of the Microsoft Windows, Azure and SQL Server user bases to the Hadoop community a good thing for Apache Hadoop? Microsoft technology is used broadly across enterprises today. Ultimately, open source is all about community building. A growing user community feeds a virtuous circle. More users means more visibility for the project. Their successes fuel the adoption of the project by more users. More users mean more folks who will ultimately become contributors or committers. This makes the code evolve more quickly, which allows it to satisfy more use cases and hence attract more users, which further drives the project forward. As the number of users and developers grow, more companies will decide that they can build hardware, tools, applications and services for Apache Hadoop users. Growth of the ecosystem allows more users to solve more problems with Apache Hadoop, driving further growth, etc. Feeding this virtuous cycle is what Hortonworks is all about.

Read More

Bringing Apache Hadoop to Windows

We are very excited to enter into a strategic relationship with Microsoft to help bring Apache Hadoop to Windows customers. We are equally pleased that Microsoft will also work closely with the Hadoop community and propose contributions back to the Apache Software Foundation and the Hadoop project.

Hortonworks will provide Microsoft with important Hadoop support and training that will help accelerate the delivery of Apache Hadoop for Windows Server and Windows Azure, including insight into feature roadmap and designs, feedback on code reviews and regression and acceptance testing.

As stated today by Ted Kummert, Microsoft Corporate Vice President, in the Microsoft press release from the SQL PASS Conference:

“Microsoft is committed to helping customers manage any data, any size, anywhere with the SQL Server data platform, Windows Server and Windows Azure.  Hortonworks has a rich history in leading the design and development of Apache Hadoop. Their experience and expertise in this space helps us accelerate our delivery of our Hadoop based distribution on Windows Server and Windows Azure while maintaining compatibility and interoperability with the broader ecosystem.”

Hortonworks and Microsoft share a common vision of making Apache Hadoop easier to use and consume. Microsoft’s commitment to Apache Hadoop further broadens the Apache Hadoop ecosystem, which is essential to accelerating its adoption in the enterprise.

Our partnership with Microsoft is the first of many to come. If you are interested in partnering with us be sure to contact us.

Go to page:« First...23456