The Hortonworks Blog

Posts categorized by : Apache Hadoop

 

UPDATE: Since this article was posted, the Stinger initiative has continued to drive to the goal of 100x Faster Hive. You can read the latest information at http://hortonworks.com/stinger

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop.  Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. …

 

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos.  Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.…

 

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! 

Pig can easily stuff Redis full of data. To do so, we’ll need to convert our data to JSON. We’ve previously talked about pig-to-json in JSONize anything in Pig with ToJson. Once we convert our data to json, we can use the pig-redis project to load redis.

Build the pig to json project:

git clone git@github.com:rjurney/pig-to-json.git ant

Then run our Pig code:

/* Load Avro jars and define shortcut */ register /me/Software/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/Software/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); register /me/Software/pig-to-json/dist/lib/pig-to-json.jar register /me/Software/pig-redis/dist/pig-redis.jar -- Enron emails are available at https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro emails = load '/me/Data/enron.avro' using AvroStorage(); json_test = foreach emails generate message_id, com.hortonworks.pig.udf.ToJson(tos) as bag_json; store json_test into 'dummy-name' using com.hackdiary.pig.RedisStorer('kv', 'localhost');

Now run our Flask web server:

python server.py

Code for this post is available here: https://github.com/rjurney/enron-pig-tojson-redis-node.…

 

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software. In the spirit of openness, we think it’s important to share our perspectives around the broader context of how Apache Hadoop and Hortonworks came to be, what we are doing now, and why we believe our unique focus is good for Apache Hadoop, the ecosystem of Hadoop users, and for Hortonworks as well.…

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.…

For this post, we take a technical deep-dive into one of the core areas of HBase. Specifically, we will look at how Apache HBase distributes load through regions, and manages region splitting. HBase stores rows of data in tables. Tables are split into chunks of rows called “regions”. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process. A region is a continuous range within the key space, meaning all rows in the table that sort between the region’s start key and end key are stored in the same region.…

I recently delivered a webinar entitled “Hortonworks State of the Union”. For those new to Apache Hadoop, I covered a brief history of Hadoop and Hortonworks’ role within the open source community. We also covered how the platform services, data services, and operational services required to enable Hadoop as an enterprise-viable platform evolved in 2012.

Finally, we discussed the important progress made on deeply integrating Hadoop within next-generation data architectures in a way that makes sense for the enterprise.…

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank, set operations, and bag operations.

It’s helpful to understand the history of the library. Over the years, we developed several routines that were used across LinkedIn and were thrown together into an internal package we affectionately called “littlepiggy.” The unfortunate part, and this is true of many such efforts, is that the UDFs were ill-documented, ill-organized, and easily got broken when someone made a change.…

Go from Zero to Big Data in 15 Minutes!

Today Hortonworks announced the availability of the Hortonworks Sandbox, an easy-to-use, flexible and comprehensive learning environment that will provide you with fastest on-ramp to learning and exploring enterprise Apache Hadoop.

The Hortonworks Sandbox is:

  • A free download
  • A complete, self contained virtual machine with Apache Hadoop pre-configured
  • A personal, portable and standalone Hadoop environment
  • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop on your own

The Hortonworks Sandbox is designed to help close the gap between people wanting to learn and evaluate Hadoop, and the complexities of spinning up an evaluation cluster of Hadoop.…

Happy New Year, everyone!

I’m excited to kick-off our first webinar series for 2013: The True Value of Apache Hadoop.

Get all your friends, co-workers together and be prepared to geek out to Hadoop!

This 4-part series will have a mixture of amazing guest speakers covering topics such as Hortonworks 2013 vision and roadmaps for Apache Hadoop and Big Data, What’s new with Hortonworks Data Platform v1.2, How Luminar (an Entravision company) adopted Apache Hadoop, and use case on Hadoop, R and GoogleVis.…

When the term scientific computing comes up in a conversation it’s usually just the occasional science geek who shows signs of recognition. But although most people have little or no knowledge of the field’s existence, it has been around since the second half of the twentieth century and has played an increasingly important role in many technological and scientific developments. Internet search engines, DNA analysis, weather forecasting, seismic analysis, renewable energy, and aircraft modeling are just a small number of examples where scientific computing is nowadays indispensible.…

We are pleased to announce the the release of Apache Hive version 0.10.0. More than 350 JIRA issues have been fixed with this release. A few of the most important fixes include:

Cube and Rollup: Hive now has support for creating cubes with rollups. Thanks to Namit!

List Bucketing: This is an optimization that lets you better handle skew in your tables. Thanks to Gang!

Better Windows Support: Several Hive 0.10.0 fixes support running Hive natively on Windows.…

We are pleased to announce that Apache Pig 0.10.1 was recently released. This is primarily a maintenance release focused on stability and bug fixes. In fact, Pig 0.10.1 includes 42 new JIRA fixes since the Pig 0.10.0 release.

Some of the notable changes include:

  • Source code-only distribution

In the download section for Pig 10.0.1, you will now find a source-only tarball (pig-0.10.1-src.tar.gz) alongside the traditional full tarball, rpm and deb distributions.…

Introduction

This is part three of a Big Data Security blog series. You can read the previous two posts here: Part One / Part Two.

When Russell Jurney and I first teamed up to write these posts we wanted to do something that no one had done before to demonstrate the power of Big Data, the simplicity of Pig and the kind of Big Data Security Analytics we perform at Packetloop.…

Go to page:« First...10...1920212223...Last »