From the Dev Team

Follow the latest developments from our technical team

Drink from Elephant’s Well Of Knowledge

Developer success starts with open and reusable code, and a community that allows for both consumption of code and contribution of updates to the code base. This success engenders a thriving and evolving community.

To that end, today we are announcing the Hortonworks Gallery for developers. Located on GitHub, the Gallery brings together the Hortonworks’ Apache Hadoop code, Apache Ambari Views and extensions, as well as related resources into a single view for developers to use within the familiar context of Git and open source software.…

Early this year, ApacheTM FalconTM became a Top Level Project (TLP) in the Apache Software Foundation.

The project continues to mature as a framework for simplifying and orchestrating data lifecycle management in Hadoop by offering out-of-the-box data management policies. The Apache Falcon 0.6.1 release builds on this foundation by providing simplified mirroring functionality and a new user interface (UI).

The community worked very diligently to offer more than 150 product enhancements, and over 30 new features and improvements.…

Hortonworks is always pleased to see new contributions come into the open-source community. We worked with our customer, Hotels.com, to help them develop libraries and utilities around Apache Hive, the Apache ORC format and Cascading. It’s great to see the results released for the community. In this guest blog, Adrian Woodhead, Big Data Engineering Team Lead at Hotels.com, discusses the CORC project.

Hotels.com is pleased to announce the open source release of Corc, a library for reading and writing files in the Apache ORC file format using Cascading.…

As YARN drives Hadoop’s emergence as a business-critical data platform, the enterprise requires more stringent data security capabilities. The Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a platform for centralized security policy administration across the core enterprise security requirements of authorization, audit and data protection.

On June 10th, the community announced the release of Apache Ranger 0.5.0. With this release, the community took major steps to extend security coverage for Hadoop platform and deepen its existing security capabilities.…

In his blog, Tim Hall wrote, “Enterprises are embracing Apache Hadoop to enable their modern data architectures and power new analytic applications. The freedom to choose the on-premises or cloud environments for Hadoop that best meets the business needs is a critical requirement.”

One of the choices in deploying Hadoop in the cloud environment is with Microsoft Azure using Cloudbreak. Other choices include Google Cloud Platform, Openstack, and AWS.

But in this blog, I’ll show how you can deploy Hadoop in Azure with few clicks by running HDP multimode cluster in Azure’s Linux VM using Cloudbreak.…

Mayank Bansal, of EBay, is a guest contributing author of this collaborative blog.

This is the 4th post in a series that explores the theme of enabling diverse workloads in YARN. See the introductory post to understand the context around all the new features for diverse workloads as part of Apache Hadoop YARN in HDP.

Background

 In Hadoop YARN’s Capacity Scheduler, resources are shared by setting capacities on a hierarchy of queues.…

Introduction

Multihoming is the practice of connecting a host to more than a single network. This is frequently used to provide network-level fault tolerance – if hosts are able to communicate on more than one network, the failure of one network will not render the hosts inaccessible. There are other use cases for multi-homing as well, including traffic segregation to isolate congestion and support for different network media optimized for different use cases.…

The Apache community released Apache Pig 0.15.0 last week. Although there are many new features in Apache Pig 0.15.0, we would like to highlight two major improvements:

  • Pig on Tez enhancements
  • Using Hive UDFs inside Pig

Below are some details about these important features. For the complete list of features, improvements, and bug fixes, please see the release notes.

Notable Changes 1. Pig on Tez enhancements Scalability of Pig on Tez

Yahoo!…

The components in a modern data architecture vary from one enterprise to the next and the mix changes over time. Many of our Hortonworks subscribers need support ensuring that their Hortonworks Data Platform (HDP) clusters are optimally configured. This means that they need proactive, intelligent cluster analysis.

As businesses onboard new workloads to the platform, it taxes the resources of Hadoop operators. And so our customers have asked Hortonworks for guidance and best practices to reduce their operational risk and efficiently resource their staff for Hadoop operations.…

Apache Hadoop has emerged as a critical data platform to deliver business insights hidden in big data. As a relatively new technology, system administrators hold Hadoop to higher security standards. There are several reasons for this scrutiny:

  • External ecosystem that comprise of data repositories and operational systems that feed Hadoop deployments are highly dynamic and can introduce new security threats on a regular basis.
  • Hadoop deployment contains large volume of diverse data stored over longer periods of time.

Last week, the Apache Slider community released Apache Slider 0.80.0. Although there are many new features in Slider 0.80.0, few innovations are particularly notable:

  • Containerized application onboarding
  • Seamless zero-downtime application upgrade
  • Adding co-processors to app packages without reinstallation
  • Simplified application onboarding without any packaging requirement

Below are some details about these important features. For the complete list of features, improvements, and bug fixes, see the release notes.

Notable Changes: Containerized application onboarding

This release of Apache Slider provides a way to deploy containerized applications on YARN and leverage YARN’s resource management capabilities.…

Not a day passes without someone tweeting or re-tweeting a blog on the virtues of Apache Spark.

At a Memorial Day BBQ, an old friend proclaimed: “Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.”

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data; to the data scientists it offers machine libraries, MLlib component; and to data analysts it offers SQL capabilities for inquiry.…

Apache Spark provides a lot of valuable tools for data science. With our release of Apache Spark 1.3.1 Technical Preview, the powerful Data Frame API is available on HDP.

Data scientists use data exploration and visualization to help frame the question and fine tune the learning. Apache Zeppelin helps with this.

Based on the concept of an interpreter that can be bound to any language or data processing backend, Zeppelin is a web based notebook server.…

The Apache Accumulo community has announced its 1.7.0 release. As community’s first major release of 2015, the release represents the culmination of a year of effort from many Accumulo committers and contributors. Apart from many notable changes enumerated below, Accumulo is now well integrated with Apache Ambari.

In this release, 43 different individuals fixed 691 JIRA issues, and we thank everyone who helped in any way to make this Apache Accumulo 1.7.0 a reality.…

SQL is the most popular use case for the Hadoop user community, and Apache Hive is still the defacto standard. Early this week, the Apache Hive community released Apache Hive 1.2.0.

Already the third release this year, the Hive developer community continues to improve the release and grow its team, with 11 Hive contributors promoted to committers in the last three months. Dedicated to make Hive enterprise-ready, the community has made improvements in the following areas:

  • Additional SQL functionality
  • Security enhancements
  • Performance gains
  • Stability and usability
  • For the complete list of features, improvements, and bug fixes, see the release notes.…