We’re excited to announce the long-awaited release of Hortonworks Data Platform 3.0.0 – our first major HDP version change since the last one in 2013. HDP 3.0 is faster, smarter, hybrid, bigger, trusted and real-time database. We encourage you to read our blog that announced HDP 3.0. You can also view our keynote presentation from DataWorks Summit, where we demoed an autonomous car (1/10 scale), trained with HDP 3.0 technologies. We also want to thank our current and prospective customers and partners for signing up for the HDP 3.0 Early Access and we are encouraged by the huge interest- now, you can use the HDP 3.0 Generally Available repositories and documentation.
A Giant Leap for Big Data Ecosystem
As we are heading into the fourth industrial revolution, HDP 3.0 is a giant leap for the Big Data ecosystem, with major changes across the stack and expanded eco-system (Deep Learning and 3rd Party Dockerized Apps). HDP 3.0 can be deployed both on-premise and in the major cloud platforms – AWS, Microsoft Azure, and Google Cloud. Many of the HDP 3.0 new features are based on Apache Hadoop 3.1 and include containerization, GPU support, Erasure Coding and Namenode Federation. In order to provide a Trusted Data Lake, we are installing Apache Ranger and Apache Atlas by default with HDP 3.0. In order to streamline the stack, we have removed components such as Apache Falcon, Apache Mahout, Apache Flume, and Apache Hue, and absorbed Apache Slider functionalities into Apache YARN.
Highlighted features of Apache Hadoop HDFS include:
- Erasure Coding for Cold Data
- Reduce the storage overhead by 50% with Reed Solomon encoding with 6 data shards and 3 parity shards, while maintaining the same resiliency, as with 3 replica approach (optional Intel Storage Acceleration library in HDP Utilities for hardware offload).
- Namenode Federation
- Scale HDFS namespace linearly with Namenode federation, enabled by Ambari UI wizard, with support for Apache Hive, Apache Spark, Apache Ranger.
- Cloud Storage & Enterprise Hardening
- Google Cloud Storage connector
- View Filesystem to enable a unified global view, with NFS gateway support
- Multiple standby Namenodes per namespace to increase availability (no Ambari UI support)
- Intra-node disk balancing of disks with varied capacities inside a datanode
Data Operating System
Highlighted features of Apache Hadoop YARN include:
- Containerized Services on Apache YARN
- Support Docker Containers running on Apache YARN
- Support Dockerized Spark jobs on Apache YARN
- Support Slider functionalities, simplified REST API and simplified discovery of DNS on Apache YARN
- Next Generation Features and Enterprise Hardening
- Support GPU pooling and isolation on Apache YARN
- Support generalized resource placement on Apache YARN: affinity/anti-affinity
- Support Intra-queue preemption to support load balancing between different applications (batch, real-time) in the same queue
- Enhanced Reliability, Availability and Serviceability
- User and developer friendly Apache YARN UI
- Scalable Application Timeline Services version 2.0 to enable flow based application performance management (APM)
Highlighted Apache Hive features include:
- Workload management for LLAP: You can assign resource pools within LLAP pool and allocate resources on a per user or per group basis. This enables support for large multi-tenant deployments.
- ACID v2 and ACID on by default: We are releasing ACID v2. With the performance improvements in both storage format and execution engine we are seeing equal or better performance when comparing to non-ACID tables. Thus we are turning ACID on by default and enable full support for data updates.
- Hive Warehouse Connector for Spark: Hive Warehouse Connector allows you to connect Spark application with Hive data warehouses. The connector automatically handles ACID tables. This enables data science workloads to work well with data in Hive.
- Materialized view navigation: Materialized view allows you to pre-aggregate and pre-compute tables used in queries. Typically works best on sub-queries or intermediate tables. The cost based optimizer will automatically plan a query if those intermediate results are available, drastically speed up your queries.
- Information schema: Hive now exposes the metadata of the database (tables, columns etc.) via Hive SQL interface directly.
- JDBC storage connector: You can now map any JDBC databases into Hive’s catalog. This means you can join data across Hive and other databases using Hive query engine
Highlighted Druid features include:
- Kafka-Druid ingest: You can now map a kafka topic into a Druid table. The events will be automatically ingested and available for querying in near real-time. This is different from Kafka-Hive ingest where data are loaded into Hive table periodically using SQL merge. The latter has a 5-10 minutes data latency.
Machine Learning & Deep Learning Platform
Highlighted features of Apache Spark, Apache Zeppelin, Livy include:
- Support Apache Spark 2.3.1 GA
- Structured Streaming support for ORC
- Enable Security and ACLs in History Server
- Support running Spark jobs in a Docker Container
- Upgrade Spark/Zeppelin/Livy from HDP 2.6 to HDP 3.0
- Cloud: Spark testing with S3Guard/S3A Committers
- Certification for the Staging Committer with Spark
- Integrate with new Metastore Catalog feature
- Beeline support for Spark thrift server
- Configure LLAP mode in Ambari
- Support per notebook interpreter configuration
- Livy to support ACLs
- Knox to proxy Spark History Server UI
- Structured Streaming support for Hive Streaming library
- Transparent write to Hive warehouse
- Spark-LLAP connector GA for Ranger
- TensorFlow 1.8 (tech preview only)
Stream Processing Engine
Highlighted features of Apache Kafka and Apache Storm include:
- Support Kafka 1.0.1
- Critical updates
- KAFKA-6172 – Cache lastEntry in TimeIndex to avoid unnecessary disk access
- KAFKA-6175 – AbstractIndex should cache index file to avoid unnecessary disk access during resize()
- KAFKA-6258 – SSLTransportLayer should keep reading from socket until either the buffer is full or the socket has no more data
- Support Storm 1.2.1, with support for all HDP 3.0 components including Hadoop/HDFS 3.0, HBase 2.0 and Hive 3.
- Capture producer and topic partition level metrics without instrumenting or configuring interceptors on the clients. This provides an non-invasive approach to capture important metrics for producers without refactoring/modifying your existing Kafka clients.
Operational Data Store
Highlighted Apache HBase features include:
- Backup and restore: HBase now has native support of backup/restore. This means both full and incremental backup/restore support. This is an important tool in the admin’s toolkit. Support for DLM integration (i.e. UI) is coming in next release.
- Procedure V2: You can use Procedure V2( Known as Proc-v2 in community), which is an updated framework for executing multi-step, HBase administrative operations when there is a failure. The introduction of this capability is to implement all master operations using proc-v2 to remove the need for tools like hbck in the future. Use proc-v2 for creating, modifying and deleting tables. Other systems like new AssignmentManager is implemented using proc-v2.
- Fully off-heap read/write path: When you write data into HBase through Put operations, the cell objects do not enter JVM heap until the data is flushed to disk in an HFile. This helps to reduce total heap usage of a RegionServer and it copies less data making it more efficient.
- Use of Netty for RPC layer and Async API: This replaces the old Java NIO RPC server with a Netty RPC server. Netty provides you the ability to easily provide an Asynchronous Java client API.
- In-memory compactions (Accordion): Periodic reorganization of the data in the Memstore can result in a reduction of overall I/O, that is data written and accessed from HDFS. The net performance increases when we keep more data in memory for a longer period of time.
- Better dependency management: HBase now internally shades commonly-incompatible dependencies to prevent issues for downstream users. You can use shaded client jars that will reduce the burden on the existing applications.
- Coprocessor and Observer API rewrite: There are minor changes made to the API to remove ambiguous, misleading, and dangerous calls.
Highlighted Apache Phoenix features include:
- HBase 2.0 support
- Python driver for Phoenix Query Server: This is a community driver that is brought into the Apache Phoenix project. It Provides Python db 2.0 API implementation.
- Query log: This is a new system table “SYSTEM.LOG” that captures information about queries that are being run against the cluster (client-driven).
- Column encoding: This is new to HDP. You can use a custom encoding scheme of data in the HBase table to reduce the amount of space taken. This increases the performance due to less data to read and thereby reduces the storage. The performance gain is 30% and above for the sparse tables.
- Hive 3.0 support for Phoenix: It provides updated phoenix-hive StorageHandler for the new Hive version. (Tech-preview)
- Spark 2.3 support for Phoenix: It provides updated phoenix-spark driver for new the Spark version.
- Supports GRANT and REVOKE commands: It provides automatic changes to indexes ACLs, if access changed for data table or view.
- This version introduces support for sampling tables.
- Supports atomic update (ON DUPLICATE KEY).
- Supports snapshot scanners for MR-based queries.
- Hardening of both the secondary indexes that includes Local and Global.
Security & Governance
Highlighted features include:
- Core policy engine and audit enhancements
- Schedulable Policies: Policy effective dates to support time-bound authorization policies and temporary policies
- Override policies to support temporary resource access and override masking/row filtering for specific users
- Auditor and KMS Auditor roles to support read-only access to services, policies, users/groups, audits, and reports
- Show Hive query in access audits UI
- Auditing of usersync operations in Ranger Admin UI
- Policy labels to group and organize policies and filter/search by labels
- Users membership in groups shown in Ranger Admin UI
- Ecosystem Coverage & Enhancements
- Metadata Security via fine-grained authorization for Atlas
- Performance improvements in Atlas Tag Sync service for
- Hive UDF execution/usage authorization
- Hive workload management authorization
- Support for entitlement mapping via Hive Information Schema
- HDFS namenode federation support
- Improved indexing infrastructure with Solr 7 support
- Ranger plugins HDP3 ecosystem compatibility (Hive, HDFS, Storm, HBase, Kakfa, Yarn, Kafka)
- Enterprise Readiness
- Ability to specify passwords for admin accounts during ranger install
- Consolidated db schema script for all supported DB flavor
- Ranger and Atlas installed and configured and turned ON by default in HDP3
- Admin UI along with service discovery and topology generation feature for simplifying and accelerating Knox configuration
- Added SSO support for Zeppelin, YARN, MR2, HDFS, and Oozie
- Added Knox Proxy support for YARN, Oozie, SHS (Spark History Server), HDFS, MR2, Livy, and SmartSense
- Core Metadata Capabilities
- New Glossary & Business Catalog enabling business users to capture natural business terminology and provide business vocabulary management (term categorization, business term-asset association, semantic term relationships, hierarchies)
- Classification (tag) Propagation: Improved chain of custody through classification (tag) propagation to related or derived assets with fine-grained control over propagation
- Metadata Security: Fine grained authorization to metadata in data catalog (authorize metadata operations at a specific tag, data asset or type or admin operations such as metadata import/export)
- Time-bound classification or business catalog mapping
- Ecosystem Coverage & Enhancements
- New Spark Hook (Technical Preview) to capture Spark SQL, Dataframe, and model metadata and lineage in Atlas
- New HBase hook to capture metadata and lineage
- Improved indexing infrastructure with Solr 7 support
- New graph backend infrastructure with JanusGraph DB offering Tinkerpop 3 standards compatibility, improved scale, and performance
- Updated Atlas hooks for HDP3 ecosystem compatibility ( Hive, Storm/Kafka, Sqoop)
- Improved metadata load performance with new v2 style notifications
- Improved search performance through extensive refactoring of DSL
Operations and Management with Ambari 2.7 & SmartSense 1.5.0
HDP & HDF are installed, managed, and monitored by Apache Ambari, and in this release the Ambari community has worked hard to improve:
- Usability – The new Ambari UI has been completely overhauled and is easier to navigate, use, and perform at scale.
- Management @ Scale – Clusters are growing, and that means that Ambari has to keep up. Ambari 2.7 has been significantly refactored to allow our large operation teams to manage 5,000 node clusters.
- Simplify Security Configuration – Single Sign-On is a must for security, and for integrating with Data Plane Services (DPS), so we’ve simplified SSO setup for DPS services. FreeIPA is a wildly popular IDM tool, and we now officially support integrating with FreeIPA when enabling Kerberos.
- Automation – Ambari has a robust API, and our new REST API explorer helps teams discover and understand all that it has to offer.
- Extensibility – We’ve worked closely with EMC to improve our Isilon OneFS integration with Ambari and HDP. It’s now effortless to configure your cluster to work with OneFS.
- Papercuts – New features are great, but sweating the small stuff and working on improving existing functionality is greater. This release is packed with improvements to help your day-to-day life with Ambari.
Customer use Hortonworks SmartSense to improve their cluster performance and resolve support cases faster. SmartSense 1.5.0 includes the following improvements:
- Diagnostics Capture: SmartSense now captures NiFi Registry, Schema Registry, Streaming Analytics Manager, Ambari Infra, and Data Analytics Studio diagnostics.
- Activity Analysis: For users using the new HDFS NameNode Federation feature, activity data is now available per namespace. Commonly used filters have also been made global to simplify filtering and data exploration, and LLAP queries are now visible Additionally, three new activity explorer dashboards have been added: Job Comparison, User Summary, and Workload Trends.
- Ambari View: The SmartSense view includes a full description of what is captured, to enhance transparency and ease conversations with your security team.
What’s Next? HDP 3.0 Blog Series
The HDP product and engineering teams are excited to share more details on these exciting new features in the HDP 3.0 release. So, over the next few weeks, we will be publishing additional blogs as part of HDP 3.0 blog series. Please check out the eight blogs we already published on for our Hadoop 3 blog series.