From the Dev Team

Follow the latest developments from our technical team

This post’s Principal Author: Ming Ma, Software Development Manager, eBay.  With contribution from Mayank Bansal (eBay), Devaraj Das (Hortonworks), Nicolas Liochon (Scaled Risk), Michael Weng (eBay), Ted Yu (Hortonworks), John Zhao (eBay)

eBay runs Apache Hadoop at extreme scale, with tens of petabytes of data. Hadoop was created for computing challenges like ours, and eBay runs some of the largest Hadoop clusters in existence.

Our business uses Apache HBase to deliver value to our customers in real-time and we are sensitive to any failures because prolonged recovery times significantly degrade site performance and result in material loss of revenue. …

Stinger is not a product.  Stinger is a broad community based initiative to bring interactive query at petabyte scale to Hadoop. And today, as representatives of this open, community led effort we are very proud to announce delivery of Apache Hive 0.12, which represents the critical second phase of this project!

Only five months in the making, Apache Hive 0.12 comprises over 420 closed JIRA tickets contributed by ten companies, with nearly 150 thousand lines of code! …

An important tool in the Hadoop developer toolkit is the ability to look at key metrics for a MapReduce job – to understand the performance of each job and to optimize future job runs.

In this blog article, we’ll explore how HDP 2.0 stores and provides insight into the performance of a MapReduce job on YARN.

Change from MapReduce v1 and HDP 1.x

In MapReduce-v2 on YARN in HDP 2.0, the JobTracker no longer exists.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week – as Hadoop 2 goes GAArun Murthy discusses his journey with Hadoop. The journey has taken Arun from developing Hadoop, to founding Hortonworks, to this week’s release of Hadoop 2, with its Yarn-based architecture.

Arun describes the difference between MapReduce and YARN, and how they are related in Hadoop 2 (and by extension in Hortonworks Data Platform v2).…

As part of a modern data architecture, Hadoop needs to be a good citizen and trusted as part of the heart of the business. This means it must provide for all the platform services and features that are expected of an enterprise data platform.

The Hadoop Distributed File System is the rock at the core of HDP and provides reliable, scalable access to data for all analytical processing needs. With HDP 2.0, built into the platform itself, HDFS now has automated failover with a hot standby, with full stack resiliency.…

Security is one of the biggest topics in Hadoop right now. Historically Hadoop has been a back-end system accessed only by a few specialists, but the clear trend is for companies to put data from Hadoop clusters in the hands of analysts, marketers, product managers or call center employees whose numbers could be in the hundreds or thousands. Data security and privacy controls are necessary before this transformation can occur. HDP2, through the next release of Apache Hive introduces a very important new security feature that allows you to encrypt the traffic that flows between Hadoop and popular analytics tools like Microstrategy, Tableau, Excel and others.…

I’ve been working on MapReduce frameworks since mid 2005 (Hadoop’s since the start of 2006) and a fundamental feature has always been incredible throughput to access data, but no ACID transactions. That is changing.

Recently, while working with a customer that is using Apache Hive to process terabytes (and growing quickly) of sales data, they asked how to handle a business requirement to update millions of records in their sales table each day.…

This post is the fifth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Case Study: Automatic Reduce Parallelism
Motivation

Distributed data processing is dynamic by nature and it is extremely difficult to statically determine optimal concurrency and data movement methods a priori.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Mahadev Konar discusses Apache ZooKeeper, the open source Apache project that is used to coordinate various processes on a Hadoop cluster (such as electing a leader between two processes).

Mahadev was on the team at Yahoo! in 2006 that started developing what became Apache Hadoop. He has been involved with Apache ZooKeeper since 2008, when the project was open sourced.…

This post from Vinod Kumar Vavilapalli of Hortonworks and  Chris Douglas and Carlo Curino of Microsoft Research.

Great news from the Apache Hadoop YARN community! A paper describing Apache Hadoop YARN was accepted at 2013 ACM Symposium on Cloud Computing (SoCC 2013), where it won the award for best paper! Here’s the title and abstract:

Title

Apache Hadoop YARN: Yet Another Resource Negotiator [Industrial Paper]

Abstract

The initial design of Apache Hadoop was tightly focused on running massive, MapReduce jobs to process a web crawl.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Enis Soztutar discusses Apache HBase, built for random read/write access to data in billions of rows and millions of columns.

Enis began using Apache Hadoop in 2006. Now, Enis is a Hortonworks engineer and Apache HBase project management chair. He has also been a committer to Apache Hadoop since 2007 and to HBase since 2012.…

This post is the fourth in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

The previous couple of blogs covered Tez concepts and APIs.…

Thanks to all those who joined in person and virtually for the Apache Ambari Meetup at Hortonworks this week. We talked tech, we saw demos, we laughed, we cried, we ate pizza.

The central theme of the night was the newly added support for Hadoop 2. Ambari now has:

  • Hadoop 2 Stack: Ambari adds support for installing, managing and monitoring a Hadoop 2 Stack.
  • NameNode HA: Configure NameNode High Availability based on QJM support built-into HDFS2
  • YARN: Ambari manages YARN Service lifecycle and automatically deploys the MapReduce2 framework.

Personally, I’ve followed the Go Programming Language (golang) with increasing interest for a while and have been itching to really sink my teeth into it. I’ve always felt you never learn any programming language for real unless it’s used to build a fairly large, real-world solution. It’s the only way to gain tackle real issues and gain some confidence for future battles with destiny… FTR, my first real project in Java was Hadoop, circa 2006.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Alan Gates, Hortonworks Co-Founder and Apache Pig Committer, discusses using Apache Pig for efficiently managing MapReduce workloads. Pig is ideal for transforming data in Hadoop: joining it, grouping it, sorting it and filtering it.

Alan explains how Pig takes scripts written in a language called Pig Latin and translates those into MapReduce jobs.

Listen to Alan describe the future of Pig in Hadoop 2.0.…

Go to page:« First...34567...10...Last »

Thank you for subscribing!