From the Dev Team

Follow the latest developments from our technical team

The architecture of Hortonworks Data Platform (HDP) matches the blueprint for Enterprise Apache Hadoop, with data management, data access, governance, operations and security. This post focuses on one of those core components: security. Specifically, we will focus on Apache Knox Gateway for securing access to the Hadoop REST APIs.

Pseudo Federation Provider

This blog will walk through the process of adding a new provider for establishing the identity of a user.…

With Apache Hadoop YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it for batch, interactive and real-time streaming use cases. More and more independent software vendors (ISVs) are developing applications to run in Hadoop via YARN. This increases the number of users and processing engines that operate simultaneously across a Hadoop cluster, on the same data, at the same time.…

Introduction

In this 2nd part of the blog post and its accompanying IPython Notebook in our series on Data Science and Apache Hadoop, we continue to demonstrate how to build a predictive model with Apache Hadoop, using existing modeling tools. And this time we’ll use Apache Spark and ML-Lib.

Apache Spark is a relatively new entrant to the Hadoop ecosystem. Now running natively on Apache Hadoop YARN, the architectural center of Hadoop, Apache Spark is an in-memory data processing API and execution engine that is effective for machine learning and data science use cases.…

Hadoop Operations for provisioning, managing and monitoring a cluster are critical to the success of a Hadoop project and having an intuitive and effective set of tooling has become a foundational element of a Hadoop distribution. Within HDP, we provide completely open source Apache Ambari to help you be successful with Hadoop operations.

The rate of innovation in the Ambari community is astonishing and this pace continues with the 7th release of the project this year alone, Apache Ambari 1.7.0.…

Our customers have many choices of infrastructure to deploy HDP: on premise, cloud, virtualized and even as an appliance. Further, our customers have a choice of deploying on Linux and Windows operating systems. You can easily see this creates a complex matrix. At Hortonworks, we believe you should not be limited to just one option but have the option to choose the best combination of infrastructure and operating system based on the usage scenario.…

It gives me great pleasure to announce that the Apache Hadoop community has released Apache Hadoop 2.6.0 !

In particular, we are excited about three major pieces in this release: heterogeneous storage in HDFS with SSD & Memory tiers, support for long-running services in YARN and rolling upgrades—the ability to upgrade your cluster software and restart upgraded nodes without taking the cluster down or losing work in progress. With YARN as its architectural center, Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways.…

With YARN as its architectural center, Apache Hadoop continues to attract new engines to run within the data platform, as organizations want to efficiently store their data in a single repository and interact with it simultaneously in different ways. Apache Tez supports YARN-based, high performance batch and interactive data processing applications in Hadoop that need to handle datasets scaling to terabytes or petabytes.

The Apache community just released Apache Pig 0.14.0,and the main feature is Pig on Tez.…

While YARN has allowed new engines to emerge for Hadoop, the most popular integration point with Hadoop continues to be SQL and Apache Hive is still the defacto standard. Although many SQL engines for Hadoop have emerged, their differentiation is being rendered obsolete as the open source community surrounds and advances this key engine at an accelerated rate.

Last week, the Apache Hive community released Apache Hive 0.14, which includes the results of the first phase in the Stinger.next initiative and takes Hive beyond its read-only roots and extends it with ACID transactions.…

A Cosmopolitan Metropolis

Brussels, Belgium, conjures images of a cosmopolitan metropolis, where geopolitical summits are held, where world economic forums are debated, where global European institutions are headquartered, and where citizens and diplomats fluently converse in more than three languages—English, French, Dutch or German, along with other non-official local flavors.

To this colorful collage, add the image of a Hadoop Summit Europe 2015 for big data developers, practitioners, industry experts, and entrepreneurs, who make a difference in the digital world, who fluently code in multiple programming languages—Java, Python, Scala, C++, Pig, SQL, or R—and innovate and incubate Apache projects.…

Two weeks ago Hortonworks presented the third in series of 8 Discover HDP 2.2 webinars: Discover HDP 2.2: Discover HDP 2.2: Apache Falcon for Hadoop Data Governance. Andrew Ahn, Venkatesh Seetharam, and Justin Sears hosted this 3rd webinar in the series.

After Justin Sears set the stage for the webinar by explaining the drivers behind Modern Data Architecture (MDA), Andrew Ahn and Venkatesh Seetharam introduced and discussed how to use Apache Falcon for central management of data lifecycle, business continuity and disaster recovery, and audit and compliance requirement.…

Introduction

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.…

Last week Hortonworks presented the second of our eight Discover HDP 2.2 webinars. Alan Gates and Raj Bains discussed the Stinger.next initiative and new Apache Hive features for speed, scale and SQL that are included in Hortonworks Data Platform 2.2.

After an overview of HDP 2.2, Alan discussed what the Apache community accomplished with the original Stinger initiative and how that momentum continues in Stinger.next.

Alan and Raj then discussed details on three areas of innovation currently underway in the Apache Hive project:

  • For SQL – transaction with ACID semantics
  • For Speed – the cost based optimizer
  • For Scale – dynamic query optimization

Here is the complete recording of the webinar

Here is the presentation deck.…

A few weeks back, we outlined a broad initiative to invest in Spark in the context of the Hadoop ecosystem. We intend to facilitate a more efficient utilization of Hadoop cluster resources for ETL and/or Data Pipeline workloads when using Spark. Many of the lessons learned while building out MapReduce, Apache Tez and other YARN data-processing frameworks can be applied to the Spark project in order to optimize its resource utilization and to make it a good multi-tenant citizen within a YARN-based Hadoop cluster.…

Last week Hortonworks presented the first of 8 Discover HDP 2.2 webinars: Comprehensive Hadoop Security with Apache Ranger and Apache Knox. Vinay Shukla and Balaji Ganesan hosted this first webinar in the series.

Balaji discussed how to use Apache Ranger (for centralized security administration, to set up authorization policies, and to monitor user activity with auditing. He also covered Ranger innovations now included in HDP 2.2:

  • Support for Apache Knox and Apache Storm, for centralized authorization and auditing
  • Deeper integration of Ranger with the Apache Hadoop stack with support for local grant/revoke in HDFS and HBase
  • Ranger’s enterprise readiness, with the introduction of REST APIs for policy management, and scalable storage of audit in HDFS

Vinay presented Apache Knox and API security for Apache Hadoop.…

We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.

You can listen to the entire webinar recording here.…