Category Archives: Hortonworks Topics


Meet the Committer: Mahadev Konar

We had another amazing turn out on our Ambari webinar with Matt Foley a couple of weeks back. This series was meant to educate Hadoop enthusiasts and help them gain better understanding of the value of Hadoop and I think we’re on the right track. If you missed or would like a refresher from our last two webinars (Pig and Ambari) you can find the recording here: https://hortonworks.com/webinars/

We’re starting the third installment of the “Future of Apache Hadoop” series next Wednesday on “Scaling Apache Zookeeper to the Next Generation Applications” with Mahadev Konar (@mahadevkonar) Hortonworks co-founder and core contributor and PMC member of the Apache Zookeeper.

Get to know Mahadev in this third installment of our “Meet the Committer” series.

Kim: Tell us about your current role and how you interact with Apache Hadoop?

Mahadev: Currently I am leading the effort on Apache Ambari. I have spent last 5 to 6 years of my life working on Apache Hadoop and its eco system.

Kim: How did the Zookeeper project come about?

Mahadev: Apache ZooKeeper was started by a couple of my colleagues in research (Flavio and Ben) both brilliant researchers from Yahoo! (Ben has currently moved on to a different opportunity). I started working with them from the early days of ZooKeeper. We had first open sourced ZooKeeper in Sourceforge but then later moved it as a subproject of Hadoop.

Kim: Can you provide a sneak peek of your presentation and what do you expect will be key take-away for folks attending this webinar?

Mahadev: I’ll be going through a couple of use cases for Apache ZooKeeper and basic tutorial on what ZooKeeper is. The talk will also focus on the upcoming features in Apache ZooKeeper.

If you haven’t already, register now and join us next Wednesday (October 17, 2012) at 10am PDT/ 1pm EDT to discuss Apache Zookeeper: http://info.hortonworks.com/FutureofHadoopSeries.html

Insights from DataWeek: San Francisco

I spent some time at the first ever DataWeek in San Francisco last week.  It is a brand new show and it was very well-run, spread across a few cool spaces with an interesting mix of novice to experienced data professionals.  They had a good blend of labs, speakers, panels and great networking opportunities.  In all, it was great and a big thanks and kudos to the organizers.

I took part in a panel and also presented a three-hour overview of Hadoop.  There were some good questions thrown at the panel but more interesting was the discussion over the three sessions.  Before each presentation, I ran an informal survey of the room to get a sense of audience and there was an even mix of complete novice, those new to Hadoop and experienced practitioners.

Each session had lively discussion and great engagement.  There were three segments to the presentation: Hadoop market overview, Intro to Hadoop, Hadoop usage patterns.  I would also say that, in general there were three key points that the audience really seemed to focus on.

Forest/Trees :: Distribution/Project
There are Hadoop distributions and there is the Apache Hadoop project.  When you are new to this world and learning through all the media, you can get lost in this terminology and the clarification of this point seemed important to the some of the Dataweek crowd.

The conversation went a little like this… the Apache Hadoop project comprises MapReduce and HDFS.  Sometimes we refer to this as “core Hadoop” as it is the central focus of a Hadoop project. It provides redundant and reliable storage and distributed processing or compute. In order for Hadoop, the project, to become a more complete data platform, we, the community have created several related projects that make Hadoop more useful and dependable. When we package these projects (Hive, HBase, Pig, HCatalog, Ambari, ZooKepper, Oozie, etc…) with core Hadoop, this becomes a “distribution”.

A distribution came about because each project has its own release cycle and getting the right versions together is sometimes difficult.  Also, a distribution will package the projects and provide an installer to make deployment much easier.

Insatiable Thirst for Use Cases
Design Patterns by Gamma et al. has and always will be one of the best developer books written. I like design patterns because they take a lot of data and boil it down to naturally occurring state.  They make sense of chaos.

In the third hour of our overview, we presented some reusable patterns of use for Hadoop, namely, Refine, Explore and Enrich.  With refine we apply a known process to a set of big data to extract results and use them in a business process.  With explore, we use Hadoop to discover new information that was not attainable before.  Often with explore, we will operationalize findings to be used in the refine patters.  Finally with enrich we use big data to supplement and improve a user experience for an online application.

This session was scheduled for 45 minutes and went the full hour and beyond.  There were a LOT of questions and interactions.  The material was well received by the experienced professionals as it made sense of their projects and for those new to Hadoop it provided a good sense of where to start or how to approach this big data thing.

We Face Challenges
It seemed everyone wants to get started but are presented with challenges.  There were really three areas of focus in this discussion, acquiring skills, managing a cluster and building a business case. The business case and validation of a project was interesting as some said you should just start with a project and run with it, while others advocated careful planning and a formal process.I guess in the end both sides were right.

It depends on your org and what they can stomach really.I will add my two cents however…  Hadoop is open source and available to you today so use it and start addressing all three of the challenges in the immediate future.

As noted, Dataweek was a huge success and I am honored to have taken part in what surely will be a regular event.  Congrats to the organizers on the birth of a new show.

YARN Meetup at Hortonworks on Friday, Oct 12

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

Agenda includes:

  • YARN committers from Yahoo will present on current YARN deployments at Yahoo, including lessons learned, stability, etc.
  • Hortonworks YARN committers will talk about upcoming features such as RM Restart, Container Re-use for MR, Multi-resource scheduling etc.
  • Chris Riccomini from LinkedIn will talk about his experiences building new applications on top of YARN.

A WebEx session will be available, so attendees from all over the world can participate. Follow the meetup page for more information and updates to the agenda.

If you would like to add to the agenda, please get in touch with Arun, or leave a comment in the meetup page.

More information is available on meetup.com here: http://www.meetup.com/Hadoop-Contributors/events/85353562/.

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group)

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

Thanks Mark, and anytime you would like us to come to the windy city, let us know! For those of you who couldn’t be there, I have a treat for you, the recording!

Thanks Chicago Hadoop Community! Stay Classy!

InfoQ: Hadoop and Metadata (Removing the Impedance Mis-match)

InfoQ has an article out today on HCatalog by Hortonworks’ own Alan Gates and Russell Jurney.

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

Check out the article at InfoQ: http://www.infoq.com/articles/HadoopMetadata

Meet the Committer, Part One: Alan Gates

Series Introduction

Alan Gates, Founder & Architect, Collectible Trading Card

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.

Alan Gates is a leader in our Hadoop education programs. That is why I’m incredibly excited to kick off the next phase of our “Future of Apache Hadoop” webinar series. We’re starting off this segment with 4-webinar series on September 12 with “Pig out to Hadoop” with Alan Gates (twitter:@alanfgates). Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. Alan is also a member of the Apache Software Foundation and a co-founder of Hortonworks.

Get to know Alan in this first installment of our “Meet the Committer” series.

Kim: Tell us about your current role and how you interact with Apache Hadoop projects?

Alan: I wear a number of different hats.  I lead the team at Hortonworks that works on Pig, Hive, and HCatalog.  I was one of the original committers on the Pig project when it started in Apache 5 years ago, and am still an active member of the community.  I am also an active member of the HCatalog project.  As an Apache member and part of the Apache Incubator I mentor HCatalog, Bigtop, and Oozie.  This means I help those projects grow into top-level projects in Apache, mentoring them in the Apache way.

Kim: How did the Pig project come about?

Alan: Pig was started as a project in Yahoo! research.  It was originally referred to simply as “the language”.  One day one of the researchers said, “We need a name for this” and someone said, “How about Pig?”  It stuck.  After Yahoo! users began using Pig it was clear it was valuable.  Yahoo! decided to invest in making it a production quality project.  That’s when Olga Natkovich and I were brought into the project. We open sourced the project via the Apache Incubator, beefed it up to production quality, and started adding new features.

Kim: Can you provide a sneak peek of your presentation and what do you expect will be key take-away for folks attending this webinar?

Alan: I want to focus on a couple of things in the presentation.  One, Pig 0.10 has added some exciting features like UDFs in JRuby and Boolean data type as well as many language enhancements and performance improvements.  A lot of work is going into Pig now, especially with our six Google Summer of Code students pouring in new features.  I will also talk some about changes we would like to make in Pig to take advantage of new features available in Hadoop 2.0.  I hope the key take away will be different for each listener; hopefully it will be something new they did not know about Pig that will help them use it more effectively.

Kim: Who would win in a fight? Piglet or Miss Piggy?

Alan: This one’s easy.  While Piglet was busy trying to explain that he was a very small animal and hence not given to fighting Miss Piggy would give him one of her feared karate chops and it would all be over.

I hope you would join us on September 12, 2012 @10am PDT / 1pm EDT to “Pig Out to Hadoop” with Alan Gates.

In the next few weeks we will be joined by other committers and Hadoop experts, including: Matt Foley, Mahadev Konar, and Arun C. Murthy. For more information and to register, go here: http://info.hortonworks.com/FutureofHadoopSeries.html

Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

This four-part webinar series is now open for registration, and the schedule will include:

  • Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Pig Out on Hadoop
    With: Alan Gates, Hortonworks founder and contributor to Apache Pig and HCatalog projects.
    Register here.

  • Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari
    With: Matt Foley, committer and PMC member of the Apache Hadoop Project and member of Technical Staff at Hortonworks.
    Register here.

  • Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications
    With: Mahadev Konar, Hortonworks founder and contributor to the Apache Pig and HCatalog projects
    Register here.

  • Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop
    With: Arun C. Murthy, Hortonworks founder and VP of Apache Hadoop at Apache Software Foundation, the lead of the MapReduce project and YARN.
    Register here.

For more information, please register.

Previous webinars on “The Future of Apache Hadoop” are available here.

A press release is available here.

Click to Tweet: @Hortonworks unveils four new live webinars, with Q&A, on “The Future of Apache Hadoop” series http://bit.ly/OM0XpE #BigData #Hadoop

Pig Performance and Optimization Analysis

Introduction

In this post, Hortonworks Intern Jie Li talks about his work this summer on performance analysis and optimization of Apache Pig. Jie is a PhD candidate in the Department of Computer Science at Duke University. His research interests are in the area of database systems and big data computing. He is currently working with Associate Professor Shivnath Babu.

Pig Performance Analysis and Optimization

I am proud that I was among the first several interns at Hortonworks, one of the leaders in the Hadoop community. In this post, I want to summarize my project on Pig performance and also share my experience this summer.

I began working on Pig one year ago, when my classmates in CPS216 and I developed the TPC-H benchmark for Pig, in order to compare the performance of Pig and Hive. TPC-H (specified here) consists of a set of complex queries and is the well-known benchmark for the traditional data warehouse. Hive has used it to develop new features and optimize performance for some time. Our work is available in a paper here.

Although Pig is designed as a data flow language, it supports all the functionalities required by TPC-H; thus it makes sense to use TPC-H to benchmark Pig’s performance. Below is the final result.

You can see that the performance of Hive vs Pig depends on the query. During the process of comparison, we came up with a few best practice rules for writing pig queries in PIG-2397. After several iterations of rewriting Pig scripts, we managed to make Pig competitive with Hive with a few best practice rules. However, there are still a few queries for which Pig is apparently slower than Hive (such as Q1) which need further investigation.

I wanted to incorporate these best practice rules into Pig itself, therefore I continued this project this summer as an intern at Hortonworks. I was excited to do so because Hortonworks is a major contributor to Apache Pig development. With the help of Pig and Hive committers here, I successfully identified some of the bottlenecks that contributed to the performance gap in the benchmark, and was able to implement initial solutions for them.

Pig TPC-H Bottlenecks

  1. Map Aggregation vs. Combiner
  2. When the Pig TPC-H benchmark was developed, Map Aggregation PIG-2228 was not available yet. As the first step I applied Map Aggregation to Q1, which is dominated by a group-by clause with only four different groups. This turns out to be very effective: simply enable Map Aggregation and we see more than a 20% speed up. The improvement comes from the advantages of Map Aggregation, which are as follows:

    Combiner Map Aggregation
    sort-based hash-based
    serialization/deserialization no serialization/deserialization
    always on auto-disable
    blocking accumulative
    multiple invocation one invocation

    However, the current implementation of Map Aggregation is not aggressive enough.

    First, it requires the combiner to be turned on in the hope that if the Map Aggregation is not effective enough, the combiner can further help. But as the combiner hasn’t been able to auto-disable itself, it makes sense to provide separate options for turning on/off Map Aggregation and combiners independently.

    Second, the thresholds for auto-disable are too conservative such that Map Aggregation might easily get disabled. Given that Hive has used Map Aggregation since the very beginning, we can also be confident in its efficacy. I proposed and implemented these changes in PIG-2829. Below is the benchmark result comparing Map Aggregation and combiners for several queries.

    The first query is TPC-H Q1, for which Map Aggregation improves performance by more than 20%. For the other three queries, the group-by keys are varied to achieve different record reduction rates (the number of groups over the number of input records). For example, S-1 means the reduction rate is 1, i.e. the number of groups is the same as the number of input records, so the combiner doesn’t help at all and should be turned off. We can observe the overhead of the combiner in S-1, where Map Aggregate is auto-disabled. For queries with enough reduction rate, Map Aggregate can achieve better performance.

  3. Type Conversion
  4. Pig has a simple type conversion mechanism: if we declare types in the schema, Pig will immediately do type conversion for all columns that are used in the script. Otherwise, Pig will guess the types and do type conversion as late as possible. It’s clear that we want to take advantage of lazy type conversion. We can easily verify this with a simple query which loads the biggest table in TPC-H and then filters out all records by an always-false condition so we can avoid writing data and focus on the type conversion overhead.

    We can observe from the above result that with lazy type conversion, Pig can save a lot of time loading the data. As a result, one of the best practices we recommended removing all types in the schema.

    However, when benchmarking TPC-H Q1, we observed that even with Map Aggregation turned on, Pig still took 4x time as Hive. After a bit of profiling, we identified the bottleneck: type conversion, which took half the time of the entire query. The explanation was that when I removed all the types in the schema, Pig guessed some columns should be Integer though they were actually Double. When converting raw data to Integer, Pig has a backup solution that if the conversion throws an exception, it will retry by converting to Double first and then converting back to Integer.

    Therefore, for each such conversion, Pig went through exception handling for each tuple, which took 10x as much time as successful conversion, thus dominating the whole query running time. The easiest solution for users is to explicitly declare types for those columns. For Pig itself, we can either change the default type Pig will guess, or replace the exception handling with a light weight check as implemented in PIG-2835. Below is the benchmark result.

  5. Extra Jobs
  6. Reducing the number of MR jobs for a given Pig script is always effective for performance optimization. There are still many types of jobs compiled by Pig that can be removed. An extreme example is the Order-By query, which is implemented by three jobs in Pig while Hive only requires one job.

    Note that Hive achieves one job by limiting the Order By to use a single reducer. This can become a bottleneck if you are sorting large data. But even without this limitation, we can also optimize Pig to use less jobs. Skew-Join, implemented in a similar way, can also benefit from the same optimization.

    First, the map only job can be merged into the sample job and the sort job. Pig used to do this in SampleOptimizer but it was broken unintentionally. PIG-2661 tries not only to re-enable this optimization, but also to make it more aggressive so it still works if the map-only job contains operations such as filters.

    Second, the sample job can be safely removed if only one reducer is finally used for the sort job, as the partition file generated by the sample job is not useful at all. It looks like this:

    However, there are some challenges for this second optimization. It needs to be done dynamically, as we need the final number of reducers, which is available only before submitting the sort job. In addition, it needs to modify the runtime query plan, which is a completely new challenge to Pig. As a first step, I implemented a light-weight solution in PIG-483, which introduces the notion of a SkipJob, that will be skipped. Of course, eventually we need a general framework for dynamic query optimization, which will open a lot of new opportunities for optimization Pig, such as auto suggesting the type of join, auto fail-over, etc. PIG-2784 can serve as a place to discuss more details.

My Hortonworks Experience

Besides this project, I also developed some bug fixes requested by Pig users such as PIG-2780 or required by the optimization such as PIG-2779. Also I had a chance to experience the exciting moment of releasing our first product, Hortonworks Data Platform, and contributed some tests for that release!

I want to say Hortonworks is definitely a great place to work. There are lots of smart and knowledgeable people here who are all easily approachable, and I enjoyed the time spent with them during lunches, games and parties. I was also amazed by the flexible working environment where we can customize our working schedule and even work from home. I really enjoyed myself this summer, and I’m looking forward to work again with them in the near future.

Hadoop & Big Data Seminar, Coming to a City Near You

Do you want to understand how Apache Hadoop can benefit your business? Do you understand the relationship between Hadoop and your Big Data initiatives? Are you struggling to explain the benefits of Hadoop to your management team?

At Hortonworks, we are constantly being asked by business and executive audiences to explain use cases, benefits and components of Hadoop. While the interest in Big Data and Hadoop grows, this urgent and often pressing demand for a map to create value and differentiation amplifies.

Good news, Hortonworks is hosting a half-day seminar series specifically targeted at IT Managers, Directors, and Executives. The focus of these sessions will be “Big Business Value from Big Data and Hadoop.

We are thrilled at the reception these events have already garnered and urge you to register before seats are full. The list of cities and dates include:

  • Seattle – Sept 19
  • Los Angeles – Sept 20
  • Chicago – Sept 25
  • Dallas – Sept 26
  • San Francisco – Sept 27
  • DC – Oct 9
  • New York – Oct 10
  • Boston – Oct 11

REGISTER

We hope to see you there!

The Coming Majority: Mainstream Adoption and Entrepreneurship

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential.   In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate.  Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation.  Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility).  Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

The current situation is similar to data processing servers before Cloud-based solutions like Amazon’s S3 and Elastic MapReduce (EMR).   In the early 2000s, entrepreneurs had to spend a great deal of time running and maintaining servers in-house that ran their business.  When cloud-based solutions entered, it allowed developers to focus on using servers to enhance their business rather than be bogged down by its limitations.  This revolution allowed a small 10-person startup and focus 100% of their attention on innovation and bringing value to their customers rather than on the limitations of the technology. Making the platform simpler and easy-to-use will have the same effect for big data.

Greater Adoption through Innovation

Enterprise Software

Buoyed by the efforts of the Apache Hadoop community, key enterprise software players have improved access to the platform.  Hadoop platforms like HDP democratizes big data by providing easy-to-use and wide spread access for the greater community.  Efforts like these help to push the technology past the early adopters to mass adoption markets.  However, companies at this level focus on the invention of the platform.  Sustainable technological growth arises only when companies use that invention in new, unexpected ways.

Business-to-Business (B2B) Applications

Beyond the large players like Yahoo!, Netflix, smaller (often non-Hadoop) operations have sprung up all across the country around the idea of big data.  One well-known example is Splunk, which created its own propriety platform to process and analyze big data on a large scale for companies that need it.  The benefit of companies like Splunk is their ability to identify desired elements from a variety of sources – machine data, cloud architectures, visual dashboards, and Hadoop – and package their offerings into a single product.

Another more recent entry is Durham, NC based company named EvoApp.  The company has built a big data platform called Bermuda specializing in customer and market intelligence.  Continuing the trend begun by Splunk, they focus primarily on analytics, though betting its speedy and accurate runtimes will be a significant differentiator in the market place.

Business-to-Consumer Applications

Startups are also working toward using big data to solve difficult problems for the everyday consumer.

One innovative use of big data is with a mobile app called Parker by Streetline. In major cities, locating empty parking spaces can be a major concern for commuters.  City governments and app developers alike are using big data to help car drivers locate available parking spaces more effectively by having modified parking meters broadcast their availability to the targeted servers that are paired with a notification system.

Another, The Climate Corporation, tailors its insurance policies based on weather-related risk factors that could negatively affect or potentially destroy entire crop yields.  The company uses big data to make weather and soil predictions to more intelligently bet against crop failure and issue policies accordingly.  The customer may not know (nor care) how the system works, but recognizes the value in being issued tailored insurance policies based on their personal risk factors.

Limits to Widespread Adoption

Imagine the possibilities of every high school student dreaming of the software possible with Hadoop in much the same way they now do for smartphone apps.  While technology champions are necessary to invent and evangelize young technologies, the real technological boom occurs when mainstream developers get involved and begin to push the limits of the platform.  As more startups innovate using big data technologies, we can look forward to seeing a new majority.

Happy Birthday Hortonworks!

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. I returned to this point often throughout the manifesto. We committed to working with the community to accelerate the development and adoption of Apache Hadoop and we absolutely delivered on this promise. Over the past year, Apache Hadoop released Hadoop-1.0, the most stable line of Apache Hadoop ever. Hadoop-2.0, including the next generations architectures for both MapReduce and HDFS, was also released in alpha form. Apache Hadoop continues to gain momentum as proven by every important metric (downloads, web traffic, press & analyst coverage, conference and Meetup attendance, etc.). It was a banner year for Apache Hadoop and we are proud to have played an important role in making it happen.

We are “committed to open source” and commit that “all core code will remain open source”. This commitment is as solid today as it was a year ago. All code developed by Hortonworks has been contributed back to open source. In addition to our significant contributions to core Hadoop projects (MapReduce and HDFS), we have also made significant contributions to other Hadoop ecosystem projects including Ambari, HCatalog, Pig and ZooKeeper. We will continue to be a leader in the Hadoop community process and will continue to contribute all of our Hadoop development efforts back into the Apache community development process.

We will “make Apache Hadoop easier to install, manage and use”. This was a key focus for Hortonworks over the past year. We quickly learned that it would be beneficial to the market to offer a Hortonworks distribution of Apache Hadoop that delivered on this promise. Hortonworks Data Platform, which we recently made available to the entire ecosystem, addresses each of these areas. We have included an installer that greatly simplifies the installation process for Apache Hadoop. We included, for the first time, Apache Ambari, which allows organizations to manage and monitor their Hadoop clusters. We also tightly integrated Hortonworks Data Platform with Talend Open Studio for Big Data, which provides a visual design environment for connecting Hadoop with hundreds of enterprise data systems in order to make Hadoop easier to use. The result is a greatly simplified process for organizations that are looking for a pure Apache Hadoop distribution.

We will “make Apache Hadoop more robust”. Again, I’m pleased that we delivered on this promise. We were instrumental in the re-architectures of MapReduce and HDFS to address the enterprise needs of each of these core components. Our team has written a number of blogs and presentations on these topics that I strongly recommend you read if you haven’t already. Among the most significant are the following: NextGen MapReduce presentation, NextGen MapReduce Hits Mainline, Delivering on Hadoop .NEXT, Benchmarking Performance, Apache Hadoop 2.0 (Alpha) Released, Data Integrity and Availability in Apache Hadoop HDFS, An Introduction to HDFS Federation, NameNode HA Reaches an Important Milestone, Snapshots for HDFS and High Availability and Hadoop 1.0 – Perfect Together . The last post covers the ability to add new HA capabilities to the stable and proven Hadoop-1.0 line.

We will “make Apache Hadoop easier to integrate and extend”. We have made some important advancements in this area that may have gone unnoticed. Much of this work is related to HCatalog, an Apache project that provides a metadata and table management system for Hadoop. We feel strongly that HCatalog is the preferred path for simplifying data sharing between Hadoop and other enterprise data systems and have invested heavily into advancing this project and related APIs for HCatalog. By tightly integrating Talend Open Studio for Big Data, we have also made it much easier for a broader audience to integrate Hadoop with hundreds of existing data systems. We have also formed important partnerships with leaders such as Microsoft and Teradata to ensure that their platforms and applications are tightly integrated and optimized to work with Apache Hadoop.

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. Over the past year, we have made available Hortonworks University, an exceptional Hadoop training program for developers, administrators and analysts; and Hortonworks Services, which leverages the deep domain experience of the Hortonworks technical staff to provide technical support to enterprises, systems integrators and technology vendors. Our training courses, in particularly, have been very well received by students who have continually praised our hands-on lab exercises as the best in the industry. We have recently expanded our training schedule, so check it out.
There were certainly many other notable achievements over the past year including

  • The Hortonworks team grew significantly and now numbers around 90 people. We are hiring too!
  • We established partnerships with major enterprise software vendors including Microsoft and Teradata that are changing the way Hadoop will be consumed.
  • We hosted the 5th annual Hadoop Summit with great success and rave reviews and over 2250 attendees.

As you can see, we are very proud of our accomplishments in our first year. We were also glad to be recognized by Forrester as a leader in the Forrester Wave on Enterprise Hadoop Solutions. Really, how often do companies get recognized as leaders by Forrester in their very first year of existence?

While this blog took a look back at last year, stay tuned for another blog that looks forward to what we have planned for year two.

~ E14

 

Recap of Hadoop Summit 2012

I wanted to take this opportunity to say thanks to the more than 2,200 attendees, speakers and sponsors that helped to make Hadoop Summit 2012 a great success. There was tremendous buzz throughout the conference; exceeding the excitement levels of all past Hadoop conferences. It’s a great indicator for the future of Apache Hadoop and the broader big data ecosystem.

The content from this conference was outstanding, from the opening keynotes to the last round of breakout sessions. I wanted to thank the track chairs (Abhishek Mehta, Ashish Thusoo, Avik Dey, Ben Reed, Peter Sirota and Val Bercovici) for making the hard decisions that led to such an outstanding agenda. I thought the group did a great job selecting the right mix of technical, use case and best practices sessions for developers, operators and analysts. I would also like to thank the more than 110 speakers for putting in the time and effort to share their Apache Hadoop experiences.

All of the sessions at this year’s conference were recorded and we are in the process of editing these videos for placement on the Hadoop Summit website. We have also now posted most of the slides as well. Simply visit the Sessions page to access the slides and recordings.

I am pleased to announce that all of the keynote session recordings are now available. These include compelling presentations from the following speakers:

Geoffrey Moore (author of “Crossing the Chasm” and “Escape Velocity”)

Scott Burke (SVP, Advertising & Data, Yahoo!)

Dr. Philip Shelley (CTO, Sears)

Scott Gnau (VP and GM of R&D, Teradata)

Shaun Connolly (VP of Corporate Strategy, Hortonworks)

Eric Baldeschwieler (CTO, Hortonworks)

Also, if you have not yet seen the introductory video from Hadoop Summit, I strongly encourage you to watch it now (below). I have heard from quite a few folks that this video got them even more excited about the role they have played in the Apache Hadoop ecosystem.

(click HERE for a full screen version on Vimeo)

On behalf of this year’s co-hosts Hortonworks and Yahoo!, let me again thank everyone for their role in making Hadoop Summit 2012 such a success. Because of the emergence of Apache Hadoop as the foundation of the next generation enterprise data architecture, I have no doubt that next year’s conference will be even bigger and better. I can’t wait.

~ John Kreisa

Hortonworks Data Platform v1.0 Download Now Available

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

Read More

Hortonworks @ TheCUBE

By any measure, last week’s Hadoop Summit was a tremendous success. It brought together more than 2,200 people from throughout the Apache Hadoop ecosystem to share Hadoop knowledge, ideas, best practices, and interesting use cases. It was also a great chance for big data vendors to make announcements and demonstrate new and exciting solutions.

For those of you that missed the conference, or missed a particularly interesting presentation, we have some good news. Each of the 90+ keynotes and breakout sessions were recorded and we will be posting these sessions online at hadoopsummit.org over the coming days once the editing is completed.

In the meantime, I would like to draw your attention to TheCUBE videos featured on SiliconAngle TV. As conference organizers, we were very fortunate to be able to support the team from TheCUBE, including John Furrier (@furrier) and Jeff Kelly (@jeffreyfkelly). They did an outstanding job of streaming interviews with many of the industry thought leaders and providing some excellent insight into the conference happenings for those that could not attend. These sessions are all now available via their website.

Read More

My Review of Hadoop Summit 2012

The fifth annual Hadoop Summit drew to a close last week, with over 2200 Hadoopniks in attendance. While there were many innovations demonstrated, for me the best action was about Pig, HCatalog and Hive from Hortonworks and Twitter.

At the Hadoop Summit Pig Meetup, Twitter announced Ambrose, which now includes an excellent graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand your Pig scripts.

Read More

Go to page:12345...Last »