The Hortonworks Blog

Posts categorized by : Hadoop
A Cosmopolitan Metropolis

Brussels, Belgium, conjures images of a cosmopolitan metropolis, where geopolitical summits are held, where world economic forums are debated, where global European institutions are headquartered, and where citizens and diplomats fluently converse in more than three languages—English, French, Dutch or German, along with other non-official local flavors.

To this colorful collage, add the image of a Hadoop Summit Europe 2015 for big data developers, practitioners, industry experts, and entrepreneurs, who make a difference in the digital world, who fluently code in multiple programming languages—Java, Python, Scala, C++, Pig, SQL, or R—and innovate and incubate Apache projects.…

Two weeks ago Hortonworks presented the third in series of 8 Discover HDP 2.2 webinars: Discover HDP 2.2: Discover HDP 2.2: Apache Falcon for Hadoop Data Governance. Andrew Ahn, Venkatesh Seetharam, and Justin Sears hosted this 3rd webinar in the series.

After Justin Sears set the stage for the webinar by explaining the drivers behind Modern Data Architecture (MDA), Andrew Ahn and Venkatesh Seetharam introduced and discussed how to use Apache Falcon for central management of data lifecycle, business continuity and disaster recovery, and audit and compliance requirement.…

Increasingly, companies around the world are adopting Apache Hadoop as a core component of their Modern Data Architecture (MDA) in order to collect, store, analyze and manipulate massive quantities of data on their own terms—regardless of the source of that data, how old it is, where it is stored, or under what format. Once they build their Modern Data Architecture, what is the best way for them to manage and monitor their Hadoop clusters?…

Introduction

With the rapid adoption of Apache Hadoop, enterprises use machine learning as a key technology to extract tangible business value from their massive data assets. This derivation of business value is possible because Apache Hadoop YARN as the architectural center of Modern Data Architecture (MDA) allows purpose-built data engines such as Apache Tez and Apache Spark to process and iterate over multiple datasets for data science techniques within the same cluster.…

News of customer data breaches seems to hit the headlines every week and we know that attackers have become more sophisticated in their tactics. Organizations too must step up their capabilities and build robust, data-driven defense systems. Join us for a webinar on Nov. 12 to learn about the current threats against enterprises like yours, and how a Modern Data Architecture (MDA) with Hortonworks Data Platform (HDP) and Sqrrl Enterprise can enable intuitive exploration, discovery and pattern recognition over your big cyberdata.…

In part 1, Kenneth Peeples, JBoss technology evangelist and principal marketing manager for Data Virtualization and Fuse Service Works at Red Hat, gave us an overview of the Red Hat and Hortonworks webinar series and offered insights into JBoss Data Virtualization and HDP. He started with an overview of data virtualization with the Hortonworks Data Platform and went over the first use case, Sentiment and Sales Analysis. Today, he describes the three other use cases.…

Recently the Oracle Data Integrator products were certified on the Hortonworks Data Platform version 2.1 and we’re delighted to be working more closely with Oracle engineering on these kinds of efforts. We’re happy to bring this guest blog to you today, written by Alex Kotopoulis, Product Manager, Oracle Data Integration for Big Data, at Oracle to discuss the recent integration and certification initiatives. You can learn more by joining our webinar on November 11, register here.…

Last week Hortonworks presented the second of our eight Discover HDP 2.2 webinars. Alan Gates and Raj Bains discussed the Stinger.next initiative and new Apache Hive features for speed, scale and SQL that are included in Hortonworks Data Platform 2.2.

After an overview of HDP 2.2, Alan discussed what the Apache community accomplished with the original Stinger initiative and how that momentum continues in Stinger.next.

Alan and Raj then discussed details on three areas of innovation currently underway in the Apache Hive project:

  • For SQL – transaction with ACID semantics
  • For Speed – the cost based optimizer
  • For Scale – dynamic query optimization

Here is the complete recording of the webinar

Here is the presentation deck.…

On October 15 we announced that we would support Apache Hadoop as an Infrastructure as a Service (IaaS) on Microsoft Azure. This made us the first Hadoop vendor to give customers and prospects access to that flexible and scalable cloud infrastructure for their big data deployments.

This guide walks you through using the Azure Gallery to quickly deploy Hortonworks Data Platform (HDP) clusters on Microsoft Azure IaaS.

What you need is:

  • A Microsoft Azure account
  • That’s it!

Arsalan Tavakoli-Shiraji, customer engagement lead overseeing business development activities at Databricks, is our guest blogger today. In this blog, he discusses our expanded partnership built around Apache Spark on Apache Hadoop in three areas: customers, engineering, and open source.

Today Databricks and Hortonworks are announcing an expanded partnership built around Apache Spark; allow me to explain why we’re thrilled to be embarking on this journey with them.

When we started Databricks last summer, Apache Spark was in the early stages of enterprise adoption.…

A few weeks back, we outlined a broad initiative to invest in Spark in the context of the Hadoop ecosystem. We intend to facilitate a more efficient utilization of Hadoop cluster resources for ETL and/or Data Pipeline workloads when using Spark. Many of the lessons learned while building out MapReduce, Apache Tez and other YARN data-processing frameworks can be applied to the Spark project in order to optimize its resource utilization and to make it a good multi-tenant citizen within a YARN-based Hadoop cluster.…

Last week Hortonworks presented the first of 8 Discover HDP 2.2 webinars: Comprehensive Hadoop Security with Apache Ranger and Apache Knox. Vinay Shukla and Balaji Ganesan hosted this first webinar in the series.

Balaji discussed how to use Apache Ranger (for centralized security administration, to set up authorization policies, and to monitor user activity with auditing. He also covered Ranger innovations now included in HDP 2.2:

  • Support for Apache Knox and Apache Storm, for centralized authorization and auditing
  • Deeper integration of Ranger with the Apache Hadoop stack with support for local grant/revoke in HDFS and HBase
  • Ranger’s enterprise readiness, with the introduction of REST APIs for policy management, and scalable storage of audit in HDFS

Vinay presented Apache Knox and API security for Apache Hadoop.…

We recently hosted a Spark webinar as part of the YARN Ready series, aimed at a technical audience including developers of applications for Apache Hadoop and Apache Hadoop YARN. During the event, a number of good questions surfaced that we wanted to share with our broader audience in this blog. Take a look at the video and slides along with these questions and answers below.

You can listen to the entire webinar recording here.…

Merv Adrian, the widely respected Gartner analyst, recently remarked on the continuing evolution of Apache Hadoop:

YARN is the one that really matters because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads.…

HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.

WARNING: Do not attempt to modify metadata directories or files.…