Hortonworks Data Platform Version 2.2 represents yet another major step forward for Hadoop as the foundation of a Modern Data Architecture. This release incorporates the last six months of innovation and includes more than a hundred new features and closes thousands of issues across Apache Hadoop and its related projects.
Our approach at Hortonworks is to enable a Modern Data Architecture with YARN as the architectural center, supported by key capabilities required of an enterprise data platform — spanning Governance, Security and Operations. To this end, we work within the governance model of the Apache Software Foundation contributing to and progressing the individual components from the Hadoop ecosystem and ultimately integrating them into the Hortonworks Data Platform (HDP).
Our investment across all these technologies follows the same pattern.
- VERTICAL: We integrate the projects within our Hadoop distribution with YARN and HDFS in order to enable HDP to span workloads from batch, interactive, and real time and across both open source and other data access technologies. Some work we deliver in this release to deeply integrate Apache Storm and Apache Spark within Hadoop are representative of this approach.
- HORIZONTAL: We also ensure the key enterprise requirements of governance, security, and operations can be applied consistently and reliably across all the components within the platform. This allows HDP to meet the same requirements of any other technology in the data center. In HDP 2.2, our work within the Apache Ambari community helped extend integrated operations and we contributed Apache Ranger (Argus) to drive consistent security across Hadoop.
- AT DEPTH: We deeply integrate HDP with the existing technologies within the data center to augment and enhance existing technologies and capabilities so you can reuse existing skills and resources.
A Comprehensive Data Platform
With YARN as its architectural center, Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. They want SQL, streaming, machine learning, along with traditional batch and more… all in the same cluster. To this end, HDP 2.2 packages many new features. Every component is updated and we have added some key technologies and capabilities to HDP 2.2
HDP 2.2 Release Highlights
NEW: Enterprise SQL at Scale in Hadoop
While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. While many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate. This release delivers phase 1 of the Stinger.next initiative, a broad, open community based effort to improve speed, scale and SQL semantics.
- Updated SQL Semantics for Hive Transactions for Update and Delete
ACID transactions provide atomicity, consistency, isolation, and durability. This helps with streaming and baseline update scenarios for Hive such as modifying dimension tables or other fact tables.
- Improved Performance of Hive with a Cost Based Optimizer
The cost based optimizer for Hive, uses statistics to generate several execution plans and then chooses the most efficient path as it relates system resources required to complete the operation. This presents a major performance increase for Hive.
NEW: Data Science within Hadoop with Spark on YARN
Apache Spark has emerged as an elegant, attractive development API allowing developers to rapidly iterate over data via machine learning and other data science techniques. While we have supported Spark as a tech preview for the past few months, in this release we plan to deliver an integrated Spark on YARN with improved integration to Hive 0.13 support and support for ORCFile by year-end. These improvements allow Spark to easily share and deliver data within and around Spark.
NEW: Kafka for processing the Internet of Things
Apache Kafka has quickly become the standard for high-scale, fault-tolerant, publish-subscribe messaging system for Hadoop. It is often used with Storm and Spark so that you can stream events in to Hadoop in real time and its application within the “internet of things” uses cases is tremendous.
New: Apache Ranger (Argus) for comprehensive cluster security policy
With increased adoption of Hadoop, a heightened requirement for a centralized approach to security policy definition and coordinated enforcement has surfaced. As part of HDP 2.2, Apache Ranger (formerly known as Argus) delivers a comprehensive approach to central security policy administration addressing authorization and auditing. Some of the work we have delivered extends Ranger to integrate with Storm and Knox while deepening existing policy enforcement capabilities with Hive and HBase.
New: Extensive improvements to manage & monitor Hadoop
Management and monitoring a cluster continues to be high priority for organizations adopting Hadoop. Our completely open approach via Apache Ambari is unique and we are excited to have Pivotal and HP jump on board to support Ambari with some of the other leaders in the data center like Microsoft and Teradata. In HDP 2.2, over a dozen new features to aid enterprises to manage Hadoop have been added, but some of the biggest include:
- Extend Ambari with Custom Views
Ambari Views Framework offers a systematic way to plug-in UI capabilities to surface custom visualization, management and monitoring features in the Ambari Web console. A “view” extends Ambari to allow 3rd parties to plug in new resource types along with the APIs, providers and UI to support them. In other words, a view is an application that is deployed into the Ambari container.
- Ambari Blueprints deliver a template approach to cluster deployment
Ambari Blueprints are a declarative definition of a cluster. With a Blueprint, you specify a Stack, the Component layout and the Configurations to materialize a Hadoop cluster instance (via a REST API) without having to use the Ambari Cluster Install Wizard. You can define any stack to be deployed.
NEW: Ensure uptime with Rolling Upgrades
In HDP 2.2 the rolling upgrade feature takes advantage of versioned packages, investments at the core of many of the projects and the underlying HDFS High Availability configuration to enable you to upgrade your cluster software and restart upgraded services, without taking the entire cluster down.
NEW: Automated cloud backup for Microsoft Azure and Amazon S3
Data architects require Hadoop to act like other systems in the data center and business continuity through replication across on-premises and cloud-based storages targets is a critical requirement. In HDP 2.2 we extend the capabilities of Apache Falcon to establish an automated policy for cloud backup to Microsoft Azure or Amazon S3. This is the first step in a broader vision to enable extensive heterogeneous deployment models for Hadoop.
Value in a Completely Open Approach
Hortonworks is 100% committed to open source and the value provided by an active and open community of developers. HDP is the ONLY 100% open source Hadoop distribution and our code goes back into an open ASF governed project with a live and broad community.
Hortonworks leadership is not just in numbers of committers but it is depth and diversity of involvement across the numerous open source projects that comprise our distribution. We are architects and builders and many of our developers are involved across multiple projects either directly as a committer or in partnering with developers across cube walls and across the Apache community. Our investment in Enterprise Hadoop starts with YARN, which allows us to integrate applications vertically within the stack, tying them to the data operating system, but this also allows us to apply consistent capabilities for key enterprise requirements of governance, security and operations.
A tech preview of HDP 2.2 is available today at hortonwoks.com/hdp
Complete List of HDP 2.2 New Features
Apache Hadoop YARN
- Slide existing services onto YARN through ‘Slider’
- GA release of HBase, Accumulo, and Storm on YARN
- Support long running services: handling of logs, containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
- Support for CPU Scheduling and CPU Resource Isolation through CGroups
Apache Hadoop HDFS
- Heterogeneous storage: Support for archival tier
- Rolling Upgrade (This is an item that applies to the entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
- Multi-NIC Support
- Heterogeneous storage: Support memory as a storage tier (Tech Preview)
- HDFS Transparent Data Encryption (Tech Preview)
Apache Hive, Apache Pig, and Apache Tez
- Hive Cost Based Optimizer: Function Pushdown & Join re-ordering support for other join types: star & bushy.
- Hive SQL Enhancements including:
- ACID Support: Insert, Update, Delete
- Temporary Tables
- Metadata-only queries return instantly
- Pig on Tez
- Including DataFu for use with Pig
- Vectorized shuffle
- Tez Debug Tooling & UI
Apache HBase, Apache Phoenix, & Apache Accumulo
- HBase & Accumulo on YARN via Slider
- HBase HA
- Replicas update in real-time
- Fully supports region split/merge
- Scan API now supports standby RegionServers
- HBase Block cache compression
- HBase optimizations for low latency
- Phoenix Robust Secondary Indexes
- Performance enhancements for bulk import into Phoenix
- Hive over HBase Snapshots
- Hive Connector to Accumulo
- HBase & Accumulo wire-level encryption
- Accumulo multi-datacenter replication
- Storm-on-YARN via Slider
- Ingest & notification for JMS (IBM MQ not supported)
- Kafka bolt for Storm – supports sophisticated chaining of topologies through Kafka
- Kerberos support
- Hive update support – Streaming Ingest
- Connector improvements for HBase and HDFS
- Deliver Kafka as a companion component
- Kafka install, start/stop via Ambari
- Security Authorization Integration with Ranger
- Refreshed Tech Preview to Spark 1.1.0 (available now)
- ORC File support & Hive 0.13 integration
- Planned for GA of Spark 1.2.0
- Operations integration via YARN ATS and Ambari
- Security: Authentication
- Added Banana, a rich and flexible UI for visualizing time series data indexed in Solr
- Cascading 3.0 on Tez distributed with HDP — coming soon
- Support for HiveServer 2
- Support for Resource Manager HA
- Authentication Integration
- Lineage – now GA. (it’s been a tech preview feature…)
- Improve UI for pipeline management & editing: list, detail, and create new (from existing elements)
- Replicate to Cloud – Azure & S3
Apache Sqoop, Apache Flume & Apache Oozie
- Sqoop import support for Hive types via HCatalog
- Secure Windows cluster support: Sqoop, Flume, Oozie
- Flume streaming support: sink to HCat on secure cluster
- Oozie HA now supports secure clusters
- Oozie Rolling Upgrade
- Operational improvements for Oozie to better support Falcon
- Capture workflow job logs in HDFS
- Don’t start new workflows for re-run
- Allow job property updates on running jobs
Apache Knox & Apache Ranger (Argus) & HDP Security
- Apache Ranger – Support authorization and auditing for Storm and Knox
- Introducing REST APIs for managing policies in Apache Ranger
- Apache Ranger – Support native grant/revoke permissions in Hive and HBase
- Apache Ranger – Support Oracle DB and storing of audit logs in HDFS
- Apache Ranger to run on Windows environment
- Apache Knox to protect YARN RM
- Apache Knox support for HDFS HA
- Apache Ambari install, start/stop of Knox
- Allow on-demand create and run different versions of heterogeneous applications
- Allow users to configure different application instances differently
- Manage operational lifecycle of application instances
- Expand / shrink application instances
- Provide application registry for publish and discovery
- Support for HDP 2.2 Stack, including support for Kafka, Knox and Slider
- Enhancements to Ambari Web configuration management including: versioning, history and revert, setting final properties and downloading client configurations
- Launch and monitor HDFS rebalance
- Perform Capacity Scheduler queue refresh
- Configure High Availability for ResourceManager
- Ambari Administration framework for managing user and group access to Ambari
- Ambari Views development framework for customizing the Ambari Web user experience
- Ambari Stacks for extending Ambari to bring custom Services under Ambari management
- Ambari Blueprints for automating cluster deployments
- Performance improvements and enterprise usability guardrails