The Hortonworks Blog

Are you a Hadoop hot shot?  Are you the one everyone looks to for help on their Hadoop projects? Are you looking to showcase your talent to the world?

Then just maybe we have a great option for you. We recently published the Hortonworks Sandbox tutorials on GitHub. Now it’s your turn. We invite you to add your own Hadoop tutorials or to improve on the ones that we’ve published.…

YARN and the Hortonworks Data Platform 2.0 enables one Hadoop cluster to share data and analytical processing capabilities across the Enterprise organization. Organizations can use the Hortonworks Data Platform 2.0 to:

  • Pool all enterprise data into one scalable and reliable storage platform
  • Enable all analytical processing IN the data platform
  • Provide access to this data and processing across all business units

The Capacity Scheduler (CS) ensures that groups of users and applications will get a guaranteed share of the cluster, while maximizing overall utilization of the cluster.…

There’s an old proverb you’ve likely heard about blind men trying to identify an elephant. Depending on the version of the proverb you’ve heard the elephant is misidentified variously as rope, walls, pillars, baskets, brushes and more. Oddly, no-one identified it as a next-generation enterprise data platform but I guess it is an old proverb.

The Hadoop elephant is a platform though, and as such the proverb holds true. Depending on your perspective, it has different capabilities, components and integration points to meet your requirements.…

We’ve been hosting a series of webinars focusing on how to make Apache Hadoop a viable enterprise platform that powers modern data architectures.

Implementing modern data architecture with Hadoop means that it must deeply integrate with existing technologies, leverage existing skills and investments and provide key services. This guest post from David Smith, Vice President of Marketing and Community at Revolution Analytics, shares his perspective on the role of a Data Scientists in a Big Data world.…

This is a guest blog post from Gary Nakamura, CEO at our partner Concurrent, Inc. discussing Cascading Pattern and the new Hadoop tutorial they have written for the Hortonworks Sandbox. This is one of the first tutorials aimed at more experienced crowd. Enjoy!

Cascading Pattern: Deploy Predictive Models on Hadoop in minutes.

Cascading Pattern signifies an important milestone for Cascading as we continue our mission of driving innovation and to simplify Big Data application development.…

In this post we’ll cover some new scheduling options available via Apache Oozie in HDP 2. You can try out these capabilities in HDP 2 Beta and HDP 2 Beta Sandbox.

What Is Oozie Again?

Apache Oozie is a workflow engine and scheduler for Hadoop. Oozie allows you to run jobs in Hadoop at pre-defined intervals. The jobs can be simple ones that execute single Hive or Pig commands or can be full directed acyclic graphs representing complex workflows.…

Albert Einstein is credited with saying that he doesn’t worry about the future because it would arrive soon enough. We don’t worry the future either — we focus on building it. And today, we are delighted to release the Hortonworks Data Platform 2.0 Beta Sandbox. This is the single-node VM based on the HDP 2.0 Beta release. This release is in the easy-to-use Sandbox form factor and allow you to easily work with a stable, reliable v2 of Hadoop.…

In March of 2013 we announced our plans to enter the European market and just six months we have not only landed but also are expanding and operating across Europe with field teams in UK, France and Germany.  Those teams are growing and, more importantly, our customer base is expanding.

What would expansion be without customers?

European customers are actively looking for solutions that enable the processing and analysis of large quantities of data, and Apache Hadoop is meeting those needs.  …

It’s not an easy task to find the right hardware configuration for Hadoop. Thanks to our partner Dell, we’ve detailed a configuration for Hortonworks Data Platform (HDP) on the Dell PowerEdge R720XD. This reference configuration introduces the server set-up that can run the HDP and is intended for organizations looking on configuring Apache Hadoop clusters within their information technology environment for big data analytics.

Download the reference here.

How big is big anyway? What sort of size and shape does a Hadoop cluster take?

These are great questions as you begin to plan a Hadoop implementation. Designing and sizing a cluster is complex and something our technical teams spend a lot of time working with customers on: from storage size to growth rates, from compression rates to cooling then there are many factors to take into account.

To make that a little more fun, we’ve built a cluster-size-o-tron which performs a more simplistic calculation based on some assumptions on node sizes and data payloads to give an indication of how big your particular big is.…

Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive.  Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.…

As the original architect of MapReduce, I’ve been fortunate to see Apache Hadoop and its ecosystem projects grow by leaps and bounds over the past seven years.

Today, most of my time is spent as an architect and committer on Apache Hive. Hive is the gateway for doing advanced work on Hadoop Distributed File System (HDFS) and the MapReduce framework. We are on the verge of releasing major improvements to Apache Hive, in coordination with work going on in Apache Tez and YARN.…

This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Overview

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing.…

With HDP 1.3 and HDP 2.0 Beta, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors.

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:

  • Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree
  • Scalable: Snapshots do not create extra copies of blocks on the file system.

We are excited to announce that the call for abstracts for Hadoop Summit Europe 2014 (April 2-3, 2014) is now open and closes on October 31st. One of the new things for this year are updated tracks providing attendees with new options.  Last year was a wildly successful event  and we received a lot of feedback on how to make things better… and we listened.

Providing high value content is what the conference is all about and we received some great suggestions from the community on how to improve the sessions.   …

Go to page:« First...10...1516171819...3040...Last »