Hortonworks on Apache Hadoop


Why not RAID-0? It’s about Time and Snowflakes

A recurrent question on the various Hadoop mailing lists is “why does Hadoop prefer a set of separate disks to the same set managed as a RAID-0 disks array?”

It’s about time and snowflakes.

JBOD and the Allure of RAID-0

In Hadoop clusters, we recommend treating each disk separately, in a configuration that is known, somewhat disparagingly as “JBOD”: Just a Box of Disks.

In comparison RAID-0, which is a bit of misnomer, there being no redundancy, stripes data across all the disks in the array. This promises some advantages:

  • Higher IO rates on small accesses
  • Higher bandwidth on larger accesses -especially write operations
  • Eliminates a hot-spot of a single disk overloaded if it’s data is more in demand

In RAID=0, data is striped across disks. When data needs to be written, it is divided up into small blocks (64KB or more).…

Read More

ApacheCon EU Day One Roundup – Part 1

Hackathon and Aeromuseum Reception

ApacheCon Europe kicked off yesterday with an all-day Hackathon followed by a committer’s reception at the Sinsheim Technik Museum, which has – among other large aircraft, a Concorde in Air France livery. My favorite was the diesel engine from a U-Boat – and its enormous drive-shaft and pistons.

Taking the Guesswork out of Hadoop Infrastructure

Winding a rented Opal through its gears along village roads for half an hour from my hotel-out-of-a-black-forest-fairy-tale, I made it to ApacheCon EU’s first day of sessions mid-way through the first talk by Steve Watt, ‘Taking the Guesswork out of Hadoop Infrastructure.’ Steve talked about the harsh reality of fitting hardware to a given workload using Hadoop with the quote: “We’ve profiled our Hadoop applications so we know what type of infrastructure we need.” — Said No One, Ever.…

Read More

Agile Data European Megatour, then Home to Atlanta!

Agile Data hits the road this month, crossing Europe with the good news about Hadoop and teaching Hadoop users how build value from data using Hadoop to build analytics applications.

We’ll be giving out discount coupons to Hadoop Summit Europe, which is March 20-21st in Amsterdam!

  1. 11/3 – Agile Data @ The Warsaw Hadoop Users Group
  2. 11/5 to 11/6 – Attending ApacheCon Europe 2012 in Sinsheim, Germany. Say hello!
  3. 11/7 – Agile Data @ The France Hadoop Users Group in Paris
  4. 11/8 – Agile Data @ Netherlands Hadoop Users Group in Utrecht
  5. 11/12 – Agile Data @ Hadoop Users Group UK in London.
  6. 11/13 – Agile Data @ HP Labs in Bristol, England.
  7. 11/15 – Agile Data @ The combined Data Science ATL / Atlanta Hadoop Users Group

  8. 11/16 – Agile Data @ The Emory Library
  9. 11/19 – Agile Data @ The Atlanta MongoDB Users Group

I’m writing this from Warsaw, the first stop on my tour.…

Read More

DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside)

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

There was a comedian who had a bit about… remember when you first saw Jurassic Park for the first time?…

Read More

Why Microsoft is committed to Hadoop and Hortonworks

This guest blog post is from Microsoft’s Dave Campbell providing more details on why they chose Hortonworks for  HDInsights.

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure.  To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.…

Read More

Rackspace and Hortonworks, a Match Made in the Clouds

As we speed towards wide spread enterprise adoption of Apache Hadoop, it has become readily apparent that this new data platform must not only capture, process and distribute data, but it also must be able to be deployed in a variety of ways, be it on premise, in a VM, as an appliance or better yet in the cloud…

Today we announced a new relationship with Rackspace in which we will develop an OpenStack based Hadoop solution for the public and private cloud. This is not just a paper relationship.  It is a joint effort to produce and make available Hortonworks Data Platform for OpenStack in early 2013.

There are customers today that deploy Hadoop clusters using HDP on dedicated hardware at Rackspace and this is now available as a turn-key, on-demand service running on the Rackspace open cloud and in clusters on private cloud infrastructure in data centers or a customer’s data center.…

Read More

Strata NYC Reporting: Monday @ Big Data Camp, Tuesday @ Strata Retrospective

This is Russell Jurney, your Big Data reporter on the ground here at Strata NYC/Hadoop World at the New York Hilton. Monday night’s main event was Big Data Camp. As in any unconference, the best action was in the hallway, meeting people you only know by reputation or from twitter. Highlights were:

  • Microsoft’s demonstration of Excel -Power Pivot -Hortonworks Data Platform
  • In light of today’s announcement – the Hadoop market just got MUCH bigger

  • Druid: Real-Time Analytics at a Billion Rows Per Second by Eric Tschetter, Co-founder of Metamarkets
  • In-RAM stores are an interesting new development as RAM becomes cheaper and cheaper, and can augment a Hadoop-centric workload.

  • The Horrors Hidden in Your Models by Steven Hillion
  • This talk stressed the importance of unit testing your statistical models and finding areas where they fall-over, then working with customers to understand the problem.

Read More

Enabling Big Data Insight for Millions of Windows Developers

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.…

Read More

HBase Futures

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Significant work has happened in each of the areas outlined above in the 0.94 and 0.96 (currently trunk) branches. For example, the MTTR (mean time to recover) work happening in HBASE-5843 will improve the data availability significantly.…

Read More

HBase at Hortonworks: An Update

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform.  HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways.  We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh.  This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0). Here are some notable ones:

  1. HBASE-4128 – Data Block Encoding of KeyValues (aka delta encoding / prefix compression) [PERFORMANCE]
  2. HBASE-4465 – Lazy-seek optimization for StoreFile scanners [PERFORMANCE]
  3. HBASE-5074 – support checksums in HBase block cache [PERFORMANCE]
  4. HBASE-5128 – [uber hbck] Online automated repair of table integrity and region consistency problems [OPERABILITY]
  5. HBASE-3584 – Allow atomic put/delete in one call [FEATURE]
  6. HBASE-5229 – Provide basic building blocks for “multi-row” local transactions [FEATURE]

And 0.94 is only the start.  …

Read More

Full stack HA in Hadoop 1: HBase’s Resilience to Namenode Failover

In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down. Note that, HBase would be functional during the time NameNode would be down. It’d only affect those operations that requires a trip to the NameNode (for example, rolling of the WAL, or compaction, or flush), and those would affect only the relevant end users (a user using the HBase get API may not be affected if that get didn’t require a new file open, for example).…

Read More

Hortonworks at Strata Conference 2012 in New York City!

Visit Hortonworks at Strata New York!

We are so excited to attend O’Reilly Strata Conference in New York next week! If you are going to be there,  please come by booth 16 meet the members of the Hortonworks team who will be happy to discuss any questions you have about Hortonworks Data Platform, business benefits, see a nice demo and walk away with cool swags!

Hortonworks will also be participating in an array of sessions and meet-ups at this conference. And we hope you can join us.

Attend our sessions!

Hadoop’s Role in the Big Data Architecture  (part of Bridge to Big Data)
Jim Walker @jaymce, Director Product Marketing
Tuesday, October 23, 3:30pm, Nassau

Future of Data Processing with Apache Hadoop 
Arun Murthy @acmurthy, Co-founder and Architect and VP, Apache Hadoop at the ASF
Wednesday,October 24, 1:40pm, Grand East (NY Hilton)

Drive Smarter Decisions with Microsoft Big data
Wednesday,October 24, 1:40pm, Regent Parlor

HDFS: What is new and future
Sanjay Radia @ssr, Co-founder of Hortoworks and Apache Hadoop Committer  and Todd Lipcon @tlipcon
Wednesday, October 24, 4:10pm

Making Pig Fly: Optimizing Data Processing on Hadoop
Thejas Madhavan Nair @thejasn and Jianyong Dai, both PMC members and committers of Apache Pig project
Thursday, October 25, 5pm, Murray West (NY Hilton)

Let’s “meet-up”!

Read More

Hadoop Summit Expands to Europe in 2013!

This will be the first and the largest European conference focused exclusively on accelerating the enterprise adoption of Apache Hadoop. The event will be a gathering for the vibrant Apache Hadoop community of developers, data scientists, data professionals and solution providers and will be held at the historic Beurs van Berlage in Amsterdam on March 20-21, 2013.

Call for papers now open!

Apache Hadoop practitioners, enthusiasts and solution providers with an idea for a talk at the event, can submit your ideas now on the call for papers page. All accepted speakers will receive complimentary admission to the event.

More information on Hadoop Summit Europe, go to: http://hadoopsummit.org/amsterdam.

Remember to follow us on Twitter and Facebook for future updates!

We hope to see you there!…

Read More

Apache Hadoop YARN Meetup at Hortonworks – ReCap!

Introduction

The Apache Hadoop YARN meetup at Hortonworks on October 12, 2012 we previously announced was a resounding success. We had a very good turnout of around seventy people from the community.

Meetup sessions
Deployments at Yahoo!

The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.

  • described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
  • efforts undergone relation to battle testing YARN including application validation and performance benchmarking.
  • summed it up with suggestions for improvements to issues like UI loading, lack of generic history server etc.

Chris Riccomini’s on “Building Applications on YARN”


Chris Riccomini from LinkedIn then presented about his experience in “Building Applications on YARN”.…

Read More

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution. …

Read More

Go to page:« First...678910...Last »