From the Dev Team

Follow the latest developments from our technical team

We at Hortonworks live by a few core principles:

  • Innovate at the core of Hadoop
  • Make Hadoop be an Enterprise Class Data Platform
  • Do it all in open source
  • Enable the ecosystem

Our vision of “Hadoop Everywhere” is shared by our partner community who bring their industry expertise, unique software value-add and passion for customer success to enable transformational change across our joint customers. We as a Hadoop community are succeeding everyday in transforming enterprises into a data-first organization.…

The Apache Hadoop community is happy to announce the release of Apache Hadoop 2.7.0! We want to express our gratitude to every contributor, reviewer and committer.

The Hadoop community fixed 923 JIRAs in total as part of the 2.7.0 release. Of the 923 fixes:

  • 259 were in Hadoop Common
  • 350 were in HDFS
  • 253 were in YARN
  • 61 were in MapReduce

Hadoop 2.7.0 is the first Hadoop release in 2015, following late last year’s 2.6.0.…

Introduction

Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, and Python that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets. Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.

In this blog, we will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark on Hortonworks Sandbox.…

Enterprises across all major industries adopt Apache Hadoop for its ability to store and process an abundance of new types of data in a modern data architecture. This “Any Data” capability has always been a hallmark feature of Hadoop, opening insight from new data sources such as clickstream, web and social, geo-location, IoT, server logs, or traditional data sets from ERP, CRM, SCM or other existing data systems.…

Hortonworks is pleased to announce the general availability of Apache Spark in Hortonworks Data Platform (HDP)— now available on our downloads page. With HDP 2.2.4 Hortonworks now offers support for your developers and data scientists using Apache Spark 1.2.1.

HDP’s YARN-based architecture enables multiple applications to share a common cluster and dataset while ensuring consistent levels of service and response. Now Spark is one of the many data access engines that works with YARN and that is supported in an HDP enterprise data lake.…

Hortonworks Data Platform (HDP) provides centralized enterprise services for comprehensive security to enable end-to-end protection, access, compliance and auditing of data in motion and at rest. HDP’s centralized architecture—with Apache Hadoop YARN at its core—also enables consistent operations to enable provisioning, management, monitoring and deployment of Hadoop clusters for a reliable enterprise-ready data lake.

But comprehensive security and consistent operations go together, and neither is possible in isolation.

We published two blogs recently announcing Ambari 2.0 and its new ability to manage rolling upgrades.…

The recent post by Jayush Luniya announced the community release of Apache Ambari 2.0. One of the three key Ambari features that Jayush discussed was Rolling Upgrades, enabling Hadoop operators to upgrade from one version of HDP to the next, with minimal disruption to the cluster.

The Hortonworks development team worked long and hard to make the Hadoop platform “rolling upgradeable”. That groundwork was available in Hortonworks Data Platform 2.2 as described in this previous post.…

This is the third post in a series exploring recent innovations in the Hadoop ecosystem that are included in Hortonworks Data Platform (HDP) 2.2. In this post, we introduce the theme of supporting rolling upgrades and downgrades of a HDFS cluster. See this previous post for an introduction on enterprise-grade rolling upgrades in HDP 2.2.

Hortonworks Data Platform provides centralized enterprise services for consistent operations of Hadoop clusters for a reliable enterprise-ready data lake.…

Advances in Hadoop security, governance and operations have accelerated adoption of the platform by enterprises everywhere. Apache Ambari is the open source operational platform for provisioning, managing and monitoring Hadoop clusters from a single pane of glass, and with the Apache Ambari 1.7.0 release last year, Ambari made it far easier for enterprises to adopt Hadoop.

Today, we are excited to announce the community release of Apache Ambari 2.0, which will further accelerate enterprise Hadoop usage by simplifying the technical challenges that slow adoption the most.…

Hortonworks is excited to announce that our first hands-on, performance based certification exam is now available! The HDP Certified Developer (HDPCD) exam is designed for Hadoop developers working with frameworks like Pig, Hive, Sqoop and Flume. This new approach to Hadoop certification is designed to allow individuals an opportunity to prove their Hadoop skills in a way that is recognized in the industry as meaningful and relevant to on-the-job performance.…

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.

Introduction

This is the third part of the blog-post series about anomaly detection from healthcare data.

In part 1, we described the dataset, the business use-case and our general approach of applying graph algorithms (specifically the personalized-PageRank algorithm) to detect anomalies in the Medicare-B dataset.…

Apache Ambari is the only 100% open source management and provisioning tool for Apache Hadoop. Recent innovations of Apache Ambari have focused on opening Apache Ambari into a pluggable management platform that can automate cluster provisioning, deploy 3rd party software and provide custom operational and developers’ views to the end user.

Join us Thursday March 26 at 10am PT, for an online technical workshop where we will cover 3 key integration points of Apache Ambari including Stacks, Views and Blueprints and deliver working examples of each.…

We hosted an Apache Slider Meetup at our Hortonworks Santa Clara office on March 4th, where committers, contributors, and community members interested in the Apache Slider congregated to hear what’s happening.

There were two presenters. To set the context for the audience, Steve Loughran, member of technical staff at Hortonworks, delivered an extemporaneous high-level overview of Apache Slider within Apache Hadoop YARN framework.

Running Dockerized Applications on YARN via Slider

Yu “Thomas” Liu gave a demo of his hot-off-the-IDE docker deployment work.…

Introduction

Today, organizations use the Apache Hadoop™ stack in the form of a central data lake to store their critical datasets and power their analytical processing workloads. A key requirement for the Hadoop cluster and the services running on it is to be highly available and flawlessly continue to function while software is being upgraded. In the past, the Hadoop community has added enterprise features such as High Availability (HA) to various components of the stack, snapshots, improved disaster recovery etc.…

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.

Introduction

This is the second part of our blog-post series about anomaly detection from healthcare data. As described in part 1, our goal is to apply the personalized-PageRank algorithm to detect anomaly in healthcare payment records, specifically the publicly available Medicare-B dataset.

In this blog post, we demonstrate the technical steps to compute the similarity graph between medical providers at scale, using HDP and Apache Pig.…