The Hortonworks Blog

An important tool in the Hadoop developer toolkit is the ability to look at key metrics for a MapReduce job – to understand the performance of each job and to optimize future job runs.

In this blog article, we’ll explore how HDP 2.0 stores and provides insight into the performance of a MapReduce job on YARN.

Change from MapReduce v1 and HDP 1.x

In MapReduce-v2 on YARN in HDP 2.0, the JobTracker no longer exists.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week – as Hadoop 2 goes GAArun Murthy discusses his journey with Hadoop. The journey has taken Arun from developing Hadoop, to founding Hortonworks, to this week’s release of Hadoop 2, with its Yarn-based architecture.

Arun describes the difference between MapReduce and YARN, and how they are related in Hadoop 2 (and by extension in Hortonworks Data Platform v2).…

I’m thrilled to note that the Apache Hadoop community has declared Apache Hadoop 2.x as Generally Available with the release of hadoop-2.2.0!

This represents the realization of a massive effort by the entire Apache Hadoop community which started nearly 4 years to date, and we’re sure you’ll agree it’s cause for a big celebration. Equally, it’s a great credit to the Apache Software Foundation which provides an environment where contributors from various places and organizations can collaborate to achieve a goal which is as significant as Apache Hadoop v2.…

I’ve been sitting on this post for a while as Apache Hadoop 2 GA work was keeping me extremely busy. As they say, better late than never, so here we go – the slides are at the end of the post.

Three weeks ago, we had a Apache Hadoop YARN meetup at LinkedIn. Kind folks at LinkedIn had offered to host us in addition to talking about exciting projects like usage of YARN at LinkedIn, and applications on YARN like Apache Samza, Apache Giraph and Apache Helix.…

Apache Storm and YARN extend Hadoop to handle real time processing of data and provides the ability to process and respond events as they happen. Our customers have told us many use cases for this technology combination and below we present a demo example complete with code so you can try it yourself.

For the demo below, we used our Sandbox VM which is a full implementation of the Hortonworks Data Platform.…

Hortonworks will be making a preview of Apache Storm integration available in Q4 of this year and will be including Apache Storm in the Hortonworks Data Platform in first half of 2014.

Any time now, the Apache Hadoop community will declare the General Availability of Hadoop 2.0 which includes the much anticipated Apache Hadoop YARN.  The YARN-based architecture of Hadoop 2 is the most significant change to Hadoop introduced in the past six years and enables Hadoop to expand from a single-purpose, batch-oriented data platform based on MapReduce into a truly multi-purpose platform supporting a wide range of data processing approaches.…

As part of a modern data architecture, Hadoop needs to be a good citizen and trusted as part of the heart of the business. This means it must provide for all the platform services and features that are expected of an enterprise data platform.

The Hadoop Distributed File System is the rock at the core of HDP and provides reliable, scalable access to data for all analytical processing needs. With HDP 2.0, built into the platform itself, HDFS now has automated failover with a hot standby, with full stack resiliency.…

Security is one of the biggest topics in Hadoop right now. Historically Hadoop has been a back-end system accessed only by a few specialists, but the clear trend is for companies to put data from Hadoop clusters in the hands of analysts, marketers, product managers or call center employees whose numbers could be in the hundreds or thousands. Data security and privacy controls are necessary before this transformation can occur. HDP2, through the next release of Apache Hive introduces a very important new security feature that allows you to encrypt the traffic that flows between Hadoop and popular analytics tools like Microstrategy, Tableau, Excel and others.…

I’ve been working on MapReduce frameworks since mid 2005 (Hadoop’s since the start of 2006) and a fundamental feature has always been incredible throughput to access data, but no ACID transactions. That is changing.

Recently, while working with a customer that is using Apache Hive to process terabytes (and growing quickly) of sales data, they asked how to handle a business requirement to update millions of records in their sales table each day.…

This post is the fifth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Case Study: Automatic Reduce Parallelism
Motivation

Distributed data processing is dynamic by nature and it is extremely difficult to statically determine optimal concurrency and data movement methods a priori.…

On October 16, we’ve been invited to join our partner SAP to talk Big Data and how the integrated SAP HANA + Hadoop approach can solve your big data challenges. This chat will be a live Google Hangout with:

  • Irfan Khan, SVP & GM SAP Global Big Data at SAP (@i_kHANA)
  • Ari Zilka,  CTO at Hortonworks (@ikarzali)
  • Timo Elliot, Innovation Evangelist at SAP (@timoelliott)

When: Wednesday, October 16, 8am PT / 11am ET / 5pm CET…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Mahadev Konar discusses Apache ZooKeeper, the open source Apache project that is used to coordinate various processes on a Hadoop cluster (such as electing a leader between two processes).

Mahadev was on the team at Yahoo! in 2006 that started developing what became Apache Hadoop. He has been involved with Apache ZooKeeper since 2008, when the project was open sourced.…

This post from Vinod Kumar Vavilapalli of Hortonworks and  Chris Douglas and Carlo Curino of Microsoft Research.

Great news from the Apache Hadoop YARN community! A paper describing Apache Hadoop YARN was accepted at 2013 ACM Symposium on Cloud Computing (SoCC 2013), where it won the award for best paper! Here’s the title and abstract:

Title

Apache Hadoop YARN: Yet Another Resource Negotiator [Industrial Paper]

Abstract

The initial design of Apache Hadoop was tightly focused on running massive, MapReduce jobs to process a web crawl.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Enis Soztutar discusses Apache HBase, built for random read/write access to data in billions of rows and millions of columns.

Enis began using Apache Hadoop in 2006. Now, Enis is a Hortonworks engineer and Apache HBase project management chair. He has also been a committer to Apache Hadoop since 2007 and to HBase since 2012.…

This post is the fourth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

The previous couple of blogs covered Tez concepts and APIs.…

Go to page:« First...89101112...2030...Last »

Thank you for subscribing!