The Hortonworks Blog

Albert Einstein is credited with saying that he doesn’t worry about the future because it would arrive soon enough. We don’t worry the future either — we focus on building it. And today, we are delighted to release the Hortonworks Data Platform 2.0 Beta Sandbox. This is the single-node VM based on the HDP 2.0 Beta release. This release is in the easy-to-use Sandbox form factor and allow you to easily work with a stable, reliable v2 of Hadoop.…

In March of 2013 we announced our plans to enter the European market and just six months we have not only landed but also are expanding and operating across Europe with field teams in UK, France and Germany.  Those teams are growing and, more importantly, our customer base is expanding.

What would expansion be without customers?

European customers are actively looking for solutions that enable the processing and analysis of large quantities of data, and Apache Hadoop is meeting those needs.  …

It’s not an easy task to find the right hardware configuration for Hadoop. Thanks to our partner Dell, we’ve detailed a configuration for Hortonworks Data Platform (HDP) on the Dell PowerEdge R720XD. This reference configuration introduces the server set-up that can run the HDP and is intended for organizations looking on configuring Apache Hadoop clusters within their information technology environment for big data analytics.

Download the reference here.

How big is big anyway? What sort of size and shape does a Hadoop cluster take?

These are great questions as you begin to plan a Hadoop implementation. Designing and sizing a cluster is complex and something our technical teams spend a lot of time working with customers on: from storage size to growth rates, from compression rates to cooling then there are many factors to take into account.

To make that a little more fun, we’ve built a cluster-size-o-tron which performs a more simplistic calculation based on some assumptions on node sizes and data payloads to give an indication of how big your particular big is.…

Just a couple of weeks ago we published our simple SQL to Hive Cheat Sheet. That has proven immensely popular with a lot of folk to understand the basics of querying with Hive.  Our friends at Qubole were kind enough to work with us to extend and enhance the original cheat sheet with more advanced features of Hive: User Defined Functions (UDF). In this post, Gil Allouche of Qubole takes us from the basics of Hive through to getting started with more advanced uses, which we’ve compiled into another cheat sheet you can download here.…

As the original architect of MapReduce, I’ve been fortunate to see Apache Hadoop and its ecosystem projects grow by leaps and bounds over the past seven years.

Today, most of my time is spent as an architect and committer on Apache Hive. Hive is the gateway for doing advanced work on Hadoop Distributed File System (HDFS) and the MapReduce framework. We are on the verge of releasing major improvements to Apache Hive, in coordination with work going on in Apache Tez and YARN.…

This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Overview

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing.…

With HDP 1.3 and HDP 2.0 Beta, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors.

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:

  • Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree
  • Scalable: Snapshots do not create extra copies of blocks on the file system.

We are excited to announce that the call for abstracts for Hadoop Summit Europe 2014 (April 2-3, 2014) is now open and closes on October 31st. One of the new things for this year are updated tracks providing attendees with new options.  Last year was a wildly successful event  and we received a lot of feedback on how to make things better… and we listened.

Providing high value content is what the conference is all about and we received some great suggestions from the community on how to improve the sessions.   …

Syncsort, a technology partner with Hortonworks, helps organizations propel Hadoop projects with a tool that makes it easy to “Collect, Process and Distribute” data with Hadoop. This process, often called ETL (Exchange, Transform, Load), is one of the key drivers for Hadoop initiatives; but why is this technology a key enabler of Hadoop? To find out the answer we talked with Syncsort’s Director Of Strategy, Steve Totman, a 15 year veteran of data integration and warehousing, provided his perspective on Data Warehouse Staging Areas.…

He loves me, he loves me not… using daisies to figure out someone’s feelings is so last century. A much better way to determine whether someone likes you, your product or your company is to do some analysis on Twitter feeds to get better data on what the public is saying. But how do you take thousands of tweets and process them?  We show you how in our video – Understand your customers’ sentiments with Social Media Data – that you can capture a Twitter stream to do Sentiment Analysis.…

We’re continuing our series of quick interviews with Apache Hadoop project committers at Hortonworks.

This week Venkat Ranganathan discusses using Apache Sqoop for bulk data movement between Hadoop and enterprise data stores. Sqoop can also move data the other way, from Hadoop into an EDW.

Venkat is a Hortonworks engineer and Apache Sqoop committer who wrote the connector between Sqoop and the Netezza data warehousing platform. He also worked with colleagues at Hortonworks and in the Apache community to improve integration between Sqoop and Apache HCatalog, delivered in Sqoop 1.4.4.…

If you are an enterprise, chances are you use SAP.  And you are also more than likely using – or planning to use – Hadoop in your data architecture.

Today, we are delighted to announce the next step in our strategic relationship with SAP as they announce a reseller agreement with Hortonworks.  Under this agreement, SAP will resell Hortonworks Data Platform and provide enterprise support for their global customer base.  This will enable SAP customers to implement a data architecture that includes SAP HANA and the Hortonworks Data Platform and in so doing leverage existing skills to take advantage of the massive scalability and performance offered by Apache Hadoop.…

This post is the first in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

In this post we introduce the motivation behind Apache Tez (http://incubator.apache.org/projects/tez.html) and provide some background around the basic design principles for the project.…

As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

In this blog post we’ll walk through how to plan for and configure processing capacity in your enterprise HDP 2.0 cluster deployment.…

Go to page:« First...1011121314...2030...Last »

Thank you for subscribing!