The Hortonworks Blog

More from Carter Shanklin

In this post we’ll cover some new scheduling options available via Apache Oozie in HDP 2. You can try out these capabilities in HDP 2 Beta and HDP 2 Beta Sandbox.

What Is Oozie Again?

Apache Oozie is a workflow engine and scheduler for Hadoop. Oozie allows you to run jobs in Hadoop at pre-defined intervals. The jobs can be simple ones that execute single Hive or Pig commands or can be full directed acyclic graphs representing complex workflows.…

The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.

Higher Compression

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings.…

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.…

The Hortonworks Sandbox is a great tool for not only learning Hadoop, but also for experimentation and application development.  Deployment in a type 2 hypervisor such as Oracle VirtualBox or VMWare Workstation is straightforward and serves the need for a single user. Sandbox can also be deployed to IaaS environments, and in this case, we walk through the steps of deploying Hortonworks Sandbox on OpenStack. For the purposes of this article, the author has used OpenStack Grizzly release running QEMU-KVM as the underlying hypervisor.…

One of the big opportunities that Hadoop provides is the processing power to unlock value in big datasets of varying types from the ‘old’ such as web clickstream and server logs, to the new such as sensor data and geolocation data.

The explosion of smart phones in the consumer space (and smart devices of all kinds more generally) has continued to accelerate the next generation of apps such as Foursquare and Uber which depend on the processing of and insight from huge volumes of incoming data.…