The Hortonworks Blog

Posts categorized by : Performance

This blog post originally appeared here and is reproduced in its entirety here. Part 1 can be found here.

The HBase BlockCache is an important structure for enabling low latency reads. As of HBase 0.96.0, there are no less than three different BlockCache implementations to choose from. But how to know when to use one over the other? There’s a little bit of guidance floating around out there, but nothing concrete.…

The Apache Tez community has voted to release 0.3 of the software.

Apache™ Tez is a replacement of MapReduce that provides a powerful framework for executing a complex topology of tasks. Tez 0.3.0 is an important release towards making the software ready for wider adoption by focussing on fundamentals and ironing out several key functions. The major action areas in this release were

  • Security. Apache Tez now works on secure Hadoop 2.x clusters using the built-in security mechanisms of the Hadoop ecosystem.…
  • This guest post from Eric Hanson, Principal Software Development Engineer on Microsoft HDInsight, and Apache Hive committer.

    Hive has a substantial community of developers behind it, including a few from the Microsoft HDInsight team. We’ve been contributing to the Stinger initiative since it was started early in 2013, and have been contributing to Hadoop since October of 2011. It’s a good time to step back and see the progress that’s been made on Apache Hive since fall of 2012, and ponder what’s ahead.…

    I recently sat down with Owen O’Malley and Carter Shanklin to discuss the dramatic improvements delivered by the Stinger Initiative to version 0.12 of Apache Hive, which is well on its way to being 100x faster than pre-Stinger versions of Hive. That means interactive queries on petabytes of data.

    Owen is one of the original architects of Apache Hadoop and Carter is the Hortonworks product manager focused on Hive. Together, they explain the speed, scale and SQL semantics delivered in Apache Hive v0.12, which is included in Hortonworks Data Platform v2.0.…

    Whether you were busy finishing up last minute Christmas shopping or just taking time off for the holidays, you might have missed that Hortonworks released the Stinger Phase 3 Technical Preview back in December. The Stinger Initiative is Hortonworks’ open roadmap to making Hive 100x faster while adding standard SQL. Here we’ll discuss 3 great reasons to give Stinger Phase 3 Preview a try to start off the new year.

    Reason 1: It’s The Fastest Hive Yet

    Whether you want to process more data or lower your time-to-insight, the benefits of a faster Hive speak for themselves.…

    This post is the seventh in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    In Tez, we recently introduced the support of a feature that we call “Tez Sessions”.…

    With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.

    You can catch me at our session ‘Apache Hive & Stinger: Petabyte Scale SQL, IN Hadoop’ along with Owen and Alan where we’ll be happy to dive into more of the details.…

    I’d like to take a quick moment to welcome Julian Hyde as the latest addition to the Hortonworks engineering team. Julian has a long history of working on data platforms, including development of SQL engines at Oracle, Broadbase, and SQLstream. He was also the architect and primary developer of the Mondrian OLAP engine, part of the Pentaho BI suite.

    Julian’s latest role has been as the author and architect of the Optiq project – an Apache licensed open source framework.…

    This post is the sixth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    Motivation

    Tez follows the traditional Hadoop model of dividing a job into individual tasks, all of which are run as processes via YARN, on the users’ behalf – for isolation, among other reasons.…

    This post is the fifth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    Case Study: Automatic Reduce Parallelism
    Motivation

    Distributed data processing is dynamic by nature and it is extremely difficult to statically determine optimal concurrency and data movement methods a priori.…

    This post is the fourth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    The previous couple of blogs covered Tez concepts and APIs.…

    This post is the third in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing.…

    As the original architect of MapReduce, I’ve been fortunate to see Apache Hadoop and its ecosystem projects grow by leaps and bounds over the past seven years.

    Today, most of my time is spent as an architect and committer on Apache Hive. Hive is the gateway for doing advanced work on Hadoop Distributed File System (HDFS) and the MapReduce framework. We are on the verge of releasing major improvements to Apache Hive, in coordination with work going on in Apache Tez and YARN.…

    This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    Overview

    Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing.…

    This post is the first in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

    In this post we introduce the motivation behind Apache Tez (http://incubator.apache.org/projects/tez.html) and provide some background around the basic design principles for the project.…

    Go to page:12

    Thank you for subscribing!