Get fresh updates from Hortonworks by email

Once a month, receive latest insights, trends, analytics information and knowledge of Big Data.

cta

Get Started

cloud

Ready to Get Started?

Download sandbox

How can we help you?

closeClose button
July 24, 2015
prev slideNext slide

Introducing Availability of HDP 2.3 – Part 2

On July 22nd, we introduced the general availability of HDP 2.3. In part 2 of this blog series, we explore notable improvements and features related to Data Access.

We are especially excited about what these data access improvements mean for our Hortonworks subscribers.

Russell Foltz-Smith, Vice President of Data Platform, at TrueCar summed up the data access impact to his business using earlier versions of HDP, and his enthusiasm for the innovation in this latest release:

TrueCar is in the business of providing truth and transparency to all the parties in the car-buying process,” said Foltz-Smith. “With Hortonworks Data Platform, we went from being able to report on 20 terabytes of vehicle data once a day to doing the same every thirty minutes–even as the data grew to more than 600 terabytes. We’re excited about HDP 2.3.

Register for Webinar on HDP 2.3

SQL on Hadoop

SQL is the Hadoop user community’s most popular way to access data, and Apache Hive is the defacto standard for SQL on Hadoop. I spoke with many of our customers at Hadoop Summit in San Jose, and a recurring theme emerged. They asked us to push harder towards SQL 2011 analytic compliance.

While we started with HiveQL, a subset of the functions available within ANSI standard SQL, the request clearly highlights the need to improve the breadth of SQL semantics available to Hive.

In fact, one of the more satisfying, if not surprising, comments that we heard had to do with performance. We are hearing that the performance improvements made over the past few years through the Stinger Initiative have made such a significant difference that additional performance boosts can wait until the SQL breadth is improved.

As organizations move to use Hive & Hadoop, they do not want to perform “SQL rewrite” for existing applications being ported onto Hadoop. The effort to reshape queries and re-test them is expensive. With that in mind, Apache Hive 1.2 was released in late May and with HDP 2.3, it further simplifies SQL development on Hadoop with these new SQL features:

In addition, Hive continues to become more reliable and more scalable with:

  • Cross-Datacenter Replication powered by a direct integration between Hive and Falcon (FALCON-1188)
  • Grace Hash Join (HIVE-9277) lets you use high-performance, memory intensive hash joins without needing to worry about queries crashing due to running out of memory.
  • And the new Vectorized Hash Join (HIVE-10072) improves hash join performance up to 5x.

But, most importantly, we’ve added a number of new tools to make Hive easier to use, deploy and administer:

  • Hive Guided Configs in Ambari 2.1 simplifies setup and tuning Hive.
  • The Hive View lets you develop and run queries directly in your web browser and…
  • The integrated Tez Debugging View gives you detailed insight into jobs, helping you optimize and tune queries.

We will continue our focus on SQL breadth to help customers ease the transition of their existing analytic applications onto HDP and to make that transition as simple as possible.

Spark 1.3.1

HDP 2.3 includes support for Apache Spark 1.3.1. The Spark community continues to innovate at an extraordinarily rapid pace. Given our leadership in Open Enterprise Hadoop, we are eager to provide our customers with the latest and most stable versions of the various Apache projects that make up HDP.

We focused the bulk of our testing has focused on Spark 1.3.1 to ensure its features and capabilities provides the best experience on Apache Hadoop YARN. The Spark community released Spark 1.4.1 just last week. While it provides additional capabilities and improvements, we plan to test 1.4.1 to harden it and fix any issues before we graduate the technical preview version of Spark to GA with inclusion in HDP.

Some of the new features of Spark 1.3.1 release are:

  • DataFrame API (Tech Preview)
  • ML Pipeline API in python
  • Direct Kafka support in Spark Streaming

Spark is a great tool for Data Science. It provides data parallel machine learning (ML) libraries, and an ML pipeline API to facilitate machine learning across all the data easier and to deliver insights faster.

We also plan to provide a Notebook experience to make data science easier and more intuitive.

Recently we worked with Databricks to deliver full ORC support with Spark 1.4 and for the foreseeable future, we plan to focus on contributing to within the Spark community to enhance its YARN integration, security, operational experience, and machine learning capabilities. It is certainly a very exciting time for Spark and the community as a whole!

Stream Processing

As more devices and sensors join the Internet of Things (IoT), they emit growing streams of data in real time. The need to analyze this data drives adoption of Apache Storm as the distributed stream processing engine. HDP is an excellent platform for IoT — for storing, analyzing and enriching real-time data. Hortonworks is eager to help customers adopt HDP for their IoT use cases, and we made a big effort in this release to increase the enterprise readiness of both Apache Storm and Apache Kafka.

Further, we simplified the developer experience by expanding connectivity of other sources of data, including support for data coming from Apache Flume. Storm 0.10.0 is a significant step forward.

Here is a brief summary of all the stream processing improvements:

  • Enterprise Readiness: Security & Operations
    • Security
      • Addressing Authentication and Authorization for Kafka — including integration with Apache Ranger (KAFKA-1682)
      • User Impersonation when submitting a Storm topology (STORM-741)
      • SSL support for Storm user interface, log viewer, and DRPC (Distributed Remote Procedure Call) (STORM-721)
    • Operations
      • Foundation for rolling upgrades with Storm (STORM-634)
      • Easier deployment of Storm topologies with Flux (STORM-561)
    • Simplification
      • Declarative Storm topology wiring with Flux (STORM-561)
      • Reduced dependency conflicts when submitting a Storm topology (STORM-848)
      • Partial Key Groupings (STORM-637)
      • Expanded connectivity:
        • Microsoft Azure Event Hubs Integration — working in conjunction with Microsoft and a solid demonstration of our continued partnership (STORM-583)
        • Redis Support (STORM-609, STORM-849)
        • JDBC/RDBMS integration (STORM-616)
        • Kafka-Flume integration (FLUME-2242)

Twitter recently announced the Heron project, which claims to provide substantial performance improvements while maintaining 100% API compatibility with Storm. The Heron project is based on Twitter’s private fork of Storm prior to Storm being contributed to Apache and before Storm’s underlying Netty-based transport was introduced.

The key point here is that the new transport layer has delivered dramatic performance improvements over the previous 0mq-based transport. The corresponding Heron research paper provides additional details regarding other architectural improvements made, but the fact that Twitter chose to maintain API compatibility with Storm is a testament to the power and flexibility of that API. Twitter has also expressed a desire to share their experiences and work with the Apache Storm community.

A number of concepts expressed in the Heron paper were already in the implementation stage within the Storm community even before it was published, and we look forward to working with Twitter to bring those and other improvements to Storm. We are also eager to continue our collaboration with Yahoo! for Storm at extreme scale.

While the 0.10.0 release of Storm is an important milestone in the evolution of Apache Storm, the Storm community is actively working on new improvements, both near and long term, continuously exploring the realm of the possible, and helping to accelerate a wide variety of IoT use cases being requested by our customers.

Systems of Engagement that Scale

The concept of Systems of Engagement has been attributed to author Geoffrey Moore. Traditional IT systems have mostly been Systems of Record that log transactions and provide the authoritative source for information. In these kinds of systems, the primary focus is on the business process and not the people involved. As a result, analytics becomes an after-thought of describing and summarizing the transactions and processes into neat reports labeled “Business Intelligence”.

In contrast to Systems of Record, Systems of Engagement are focused on people and their goal is to bring the analytics to the forefront — moving business intelligence from the back-office & descriptive mode into proactive, predictive, and ultimately prescriptive models.

hdp2.3_part2_1

The constantly-connected world powered by the web, mobile and social data has changed how customers expect to interact with businesses. Now they demand interactions that are relevant and personal. To meet this expectation, IT must move beyond the classic Systems of Record that store only business transactions and evolve into the emerging Systems of Engagement that understand users and are capable of delivering a context-rich and personalized experience.

Successful Systems of Engagement are those that manage to combine the massive volumes of customer interaction data with deep and diverse analytics. This allows Systems of Engagement to build customer profiles and give users an experience tailored to their needs through personalized recommendations. Of course, that means that Systems of Engagement must scale!

Hortonworks Data Platform gives developers the power to build scalable Systems of Engagement by combining limitless storage, deep analytics and real-time access in one integrated whole, rather than forcing developers to stitch these pieces together by hand.

hdp2.3_part2_2

Of course, all of this starts with HDFS as a massively-scalable data store. On this foundation a wide diversity of analytical solutions has been built, from Hive to Spark to Storm and many more.

 Finally, applications need a way to get data out of Hadoop in real-time in a highly-available way. For this, we have Apache HBase and Apache Phoenix, which allow data to be read from Hadoop in milliseconds using a choice of NoSQL or SQL interfaces.

HBase development continues to focus on the key attributes of scalability, reliability and performance. Notable new additions in HDP 2.3 include:

  • Upgraded to Apache HBase 1.1.1.
  • API Stability: HBase 1.0+ stabilizes APIs and guarantees compatibility with future releases.
  • Performance: Multi-WAL substantially improves HBase write performance.
  • Multi-Tenancy: Provision one cluster for multiple apps with multiple queues and IPC throttling controls.
  • More reliable cluster scale-out.

Apache Phoenix is an ANSI SQL layer on HBase, which makes developing big data applications much easier. With Phoenix, complex logic like joins are handled for you and performance is improved by pushing processing to the server. Having a real SQL interface is a key advantage that HBase has other scalable database options.

Apache Phoenix continues to improve rapidly:

  • Upgraded to Apache Phoenix 4.4.
  • Increased SQL Support: UNION ALL, Correlated Subqueries, Date/Time Functions further simplify application development.
  • Phoenix / Spark connector: Lets you seamlessly integrate advanced analytics with data stored in Phoenix.
  • Custom UDFs: So you can embed your custom business logic in SQL queries.
  • Phoenix Query Server: Lets you query Phoenix from non-Java environments like .NET using a simple web-based protocol.
  • Query Tracing

HBase is also unique in that it is a true community-driven open source database and in 2015 we continue to see a vibrant and robust community of innovation in both HBase and Phoenix. In addition to strong contribution from Hadoop vendors we’ve seen tremendous community contribution from companies such as:

  • Bloomberg
  • Cask
  • eBay
  • Facebook
  • Intel
  • Interset
  • Salesforce
  • Xiaomi
  • Yahoo!

We at Hortonworks thank everyone who contributes to making HBase and Phoenix great.

HDP Search

More and more customers are asking about search with Hadoop and search is becoming a critical part in a number of our customer deployments. We see HDP Search being deployed in conjunction with HBase and Storm in increasing frequency. In HDP 2.3, HDP Search is powered by Solr 5.2.

Recent security authorization work allows Ranger to protect Solr collections. Solr now works seamlessly on a Kerberized cluster through enhancements made for authentication. Other critically important optimization work was completed as well. This includes allowing administrators to define the HDFS replication factor. Previously, the index size was 2x larger, but through additional rules that can be defined, replica shard, collection and creation can be controlled as desired. In addition, the speed of returning query results is nearly twice as fast when compared to Solr 4.x.

As customer demand for HDP Search increases, it also requires ease of use, enterprise readiness, and simplification. This release has pushed forward on all these fronts. We want to thank our partners at Lucidworks for the close collaboration and engagement on these innovations.

Final Thoughts on Data Access

As you can see, there has been a tremendous amount of work that has gone into each of these areas over the past six to eight months. The arrival of all these capabilities broadens the ability for organizations to build new, unique and compelling applications on top of HDP — with YARN at its core. We are truly excited by the possibilities and very thankful for all the contributions from the Apache community that fuel this innovation.

Learn More:

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

If you have specific technical questions, please post them in the Forums

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>