From the Dev Team

Follow the latest developments from our technical team

A Cosmopolitan Metropolis

Brussels, Belgium, conjures images of a cosmopolitan metropolis, where geopolitical summits are held, where world economic forums are debated, where global European institutions are headquartered, and where citizens and diplomats fluently converse in more than three languages—English, French, Dutch or German, along with other non-official local flavors.

To this colorful collage, add the image of a Hadoop Summit Europe 2015 for big data developers, practitioners, industry experts, and entrepreneurs, who make a difference in the digital world, who fluently code in multiple programming languages—Java, Python, Scala, C++, Pig, SQL, or R—and innovate and incubate Apache projects.…

Two weeks ago Hortonworks presented the third in series of 8 Discover HDP 2.2 webinars: Discover HDP 2.2: Discover HDP 2.2: Apache Falcon for Hadoop Data Governance. Andrew Ahn, Venkatesh Seetharam, and Justin Sears hosted this 3rd webinar in the series.

After Justin Sears set the stage for the webinar by explaining the drivers behind Modern Data Architecture (MDA), Andrew Ahn and Venkatesh Seetharam introduced and discussed how to use Apache Falcon for central management of data lifecycle, business continuity and disaster recovery, and audit and compliance requirement.…

Introduction

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.…

Last week Hortonworks presented the second of our eight Discover HDP 2.2 webinars. Alan Gates and Raj Bains discussed the Stinger.next initiative and new Apache Hive features for speed, scale and SQL that are included in Hortonworks Data Platform 2.2.

After an overview of HDP 2.2, Alan discussed what the Apache community accomplished with the original Stinger initiative and how that momentum continues in Stinger.next.

Alan and Raj then discussed details on three areas of innovation currently underway in the Apache Hive project:

  • For SQL – transaction with ACID semantics
  • For Speed – the cost based optimizer
  • For Scale – dynamic query optimization

Here is the complete recording of the webinar

Here is the presentation deck.…

Last week Hortonworks presented the first of 8 Discover HDP 2.2 webinars: Comprehensive Hadoop Security with Apache Ranger and Apache Knox. Vinay Shukla and Balaji Ganesan hosted this first webinar in the series.

Balaji discussed how to use Apache Ranger (for centralized security administration, to set up authorization policies, and to monitor user activity with auditing. He also covered Ranger innovations now included in HDP 2.2:

  • Support for Apache Knox and Apache Storm, for centralized authorization and auditing
  • Deeper integration of Ranger with the Apache Hadoop stack with support for local grant/revoke in HDFS and HBase
  • Ranger’s enterprise readiness, with the introduction of REST APIs for policy management, and scalable storage of audit in HDFS

Vinay presented Apache Knox and API security for Apache Hadoop.…

We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.

You can listen to the entire webinar recording here.…

Merv Adrian, the widely respected Gartner analyst, recently remarked on the continuing evolution of Apache Hadoop:

YARN is the one that really matters because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads.…

HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.

WARNING: Do not attempt to modify metadata directories or files.…

Enterprise Apache Hadoop provides the fundamental data services required to deploy into existing architectures. These include security, governance and operations services, in addition to Hadoop’s original core capabilities for data management and data access. This post focuses on recent work completed in the open source community to enhance the Hadoop security component, with encryption and SSL certificates.

Last year I wrote a blog summarizing wire encryption options in Hortonworks Data Platform (HDP).…

Introduction

Hortonworks University announces a new operationally focused course for Apache Hadoop administrators. This two-day training course is designed for Hadoop administrators who are familiar with administering other Hadoop distributions and are migrating to the Hortonworks Data Platform (HDP). Through a combination of lecture and hands-on exercises you will learn how to install, configure, maintain and scale an HDP cluster

Target Audience

This course is designed for experienced Hadoop administrators and operators who will be responsible for installing, configuring and supporting the Hortonworks Data Platform.…

Since its first deployment at Yahoo in 2006, HDFS has established itself as the defacto scalable, reliable and robust file system for Big Data. It has addressed several fundamental problems of distributed storage at unparalleled scales and with enterprise grade robustness.

As more and more enterprises adopt Apache Hadoop, it is becoming a unified central storage aka Data Lake for all kinds of enterprise data. Many of these storage use cases are for file storage for classic big data applications, where HDFS is the perfect fit.…

Computers are getting smarter and we are not.

–Tim Berners Lee, Web Developer

Google, Amazon and Netflix have conditioned us. As consumers, we expect intelligent applications that predict, suggest and anticipate our every move. We want them to sift through the millions of possibilities and suggest just a few that suit our needs. We want applications that take us on a personalized journey through a world of endless possibilities.

These personalized journeys require systems to store and make sense of huge data volumes in an acceptable amount of time.…

A panel of reviewers made up of InfoWorld Test Center editors and industry experts selected Apache Storm as a winner for 2014’s InfoWorld Bossie award. The “Bossies” identify the Best of Open Source Software every year. These Bossie awards celebrate game-changing open source software projects in different domains, and the panel selected Apache Storm in the Big Data Tools category.

This is the first year that a streaming computation framework has been selected in the Big Data category, which is a tribute to Apache Storm’s broad industry adoption and versatility. …

Apache Tez has been selected as a winner for 2014’s InfoWorld Bossie award. The “Bossies” identify the Best of Open Source software every year and are awarded by a panel of InfoWorld Test Center editors and industry expert reviewers. The Bossie awards celebrate game-changing open source software projects in different domains, and Apache Tez was selected in the Big Data Tools category.

Last year, Apache Hadoop with YARN as its architectural center was awarded a Bossie.…

Internet of Things (IoT) Potential and Process

It may seem obvious (or inevitable), but many companies are embracing the Internet of Things (IoT)—and for good reasons, notes Forbes’ Mike Kavis. For one, McKinsey Global Institute reports that IoT business will reach $6.2 trillion in revenue by 2025. And second, more and more objects are becoming embedded with sensors that communicate real-time data to data centers’ networks for processing, explain McKinsey’s Chui, Loffler, and Roberts.…