The Hortonworks Blog

 

MapReduce has served us well.  For years it has been THE processing engine for Hadoop and has been the backbone upon which a huge amount of value has been created.  While it is here to stay, new paradigms are also needed in order to enable Hadoop to serve an even greater number of usage patterns.  A key and emerging example is the need for interactive query, which today is challenged by the batch-oriented nature of MapReduce. …

 

UPDATE: Since this article was posted, the Stinger initiative has continued to drive to the goal of 100x Faster Hive. You can read the latest information at http://hortonworks.com/stinger

Introduced by Facebook in 2007, Apache Hive and its HiveQL interface has become the de facto SQL interface for Hadoop.  Today, companies of all types and sizes use Hive to access Hadoop data in a familiar way and to extend value to their organization or customers either directly or though a broad ecosystem of existing BI tools that rely on this key proven interface. …

 

Back in the day, in order to secure a Hadoop cluster all you needed was a firewall that restricted network access to only authorized users. This eventually evolved into a more robust security layer in Hadoop… a layer that could augment firewall access with strong authentication. Enter Kerberos.  Around 2008, Owen O’Malley and a team of committers led this first foray into security and today, Kerberos is still the primary way to secure a Hadoop cluster.…

 

As the Release Manager for hadoop-2.x, I’m very pleased to announce the next major milestone for the Apache Hadoop community, the release of hadoop-2.0.3-alpha!

2.0 Enhancements in this Alpha Release

This release delivers significant major enhancements and stability over previous releases in hadoop-2.x series. Notably, it includes:

  • QJM for HDFS HA for NameNode (HDFS-3077) and related stability fixes to HDFS HA
  • Multi-resource scheduling (CPU and memory) for YARN (YARN-2, YARN-3 & friends)
  • YARN ResourceManager Restart (YARN-230)
  • Significant stability at scale for YARN (over 30,000 nodes and 14 million applications so far, at time of release – see more details from folks at Yahoo! 

Pig can easily stuff Redis full of data. To do so, we’ll need to convert our data to JSON. We’ve previously talked about pig-to-json in JSONize anything in Pig with ToJson. Once we convert our data to json, we can use the pig-redis project to load redis.

Build the pig to json project:

git clone git@github.com:rjurney/pig-to-json.git ant

Then run our Pig code:

/* Load Avro jars and define shortcut */ register /me/Software/pig/build/ivy/lib/Pig/avro-1.5.3.jar register /me/Software/pig/build/ivy/lib/Pig/json-simple-1.1.jar register /me/Software/pig/contrib/piggybank/java/piggybank.jar define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage(); register /me/Software/pig-to-json/dist/lib/pig-to-json.jar register /me/Software/pig-redis/dist/pig-redis.jar -- Enron emails are available at https://s3.amazonaws.com/rjurney_public_web/hadoop/enron.avro emails = load '/me/Data/enron.avro' using AvroStorage(); json_test = foreach emails generate message_id, com.hortonworks.pig.udf.ToJson(tos) as bag_json; store json_test into 'dummy-name' using com.hackdiary.pig.RedisStorer('kv', 'localhost');

Now run our Flask web server:

python server.py

Code for this post is available here: https://github.com/rjurney/enron-pig-tojson-redis-node.…

 

At Hortonworks, our strategy is founded on the unwavering belief in the power of community driven open source software. In the spirit of openness, we think it’s important to share our perspectives around the broader context of how Apache Hadoop and Hortonworks came to be, what we are doing now, and why we believe our unique focus is good for Apache Hadoop, the ecosystem of Hadoop users, and for Hortonworks as well.…

Hadoop Summit Europe 2013, the European extension of the original and world’s largest Apache Hadoop community conference, today announced its official program, featuring a keynote address from 451 Group Analyst and Research Manager for Data Management and Analytics Matt Aslett and 40 use cases and educational sessions from leading industry and community experts. In addition, Hadoop Summit Europe 2013 boasts an impressive list of Platinum, Gold and Silver sponsors, demonstrating ecosystem support for Apache Hadoop from leading producers of software and services for the enterprise.…

According to the Transaction Processing Council, TPC-H is:

The TPC Benchmark™H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions.…

Big data analytics is becoming increasingly useful to professionals in digital media, gaming, healthcare, security, finance and government, and nearly every industry you can name. Companies are analyzing vast amounts of data from various sources to shed light on customer behaviors, accelerate lead conversion, pinpoint security threats and enrich social media marketing efforts. In fact, new tools and technologies are making it easier to harness the power of Big Data and put it to use, and businesses are quickly uncovering valuable insights that were previously unavailable.…

Please join Hortonworks and Appnovation for a webinar titled “Bigger Data on Your Budget” taking place on Wednesday, February 13th at 2pm EST, 11am PST.

Register Now

Appnovation is a new Hortonworks Systems Integrator partner that is focused on cutting edge open source technologies. They are experts in Drupal, Alfresco, SproutCore and now Apache Hadoop.

In advance of this webinar, I interviewed Dave Porter, Appnovation & SproutCore Lead Developer, about the technologies they support and how Appnovation and Hortonworks are working together to provide big insights without breaking the bank.…

The Hortonworks Sandbox was recently introduced garnering incredibly positive response and feedback. We are as excited as you, and gratified that our goal providing the fastest onramp to Apache Hadoop has come to fruition. By providing a free, integrated learning environment along with a personal Hadoop environment, we are helping you gain those big data skills faster. Because of your feedback and demand for new tutorials, we are accelerating the release schedule for upcoming tutorials.…

For this post, we take a technical deep-dive into one of the core areas of HBase. Specifically, we will look at how Apache HBase distributes load through regions, and manages region splitting. HBase stores rows of data in tables. Tables are split into chunks of rows called “regions”. Those regions are distributed across the cluster, hosted and made available to client processes by the RegionServer process. A region is a continuous range within the key space, meaning all rows in the table that sort between the region’s start key and end key are stored in the same region.…

The customer data that companies collect from websites, social media, blogs, digital advertising and mobile is exploding. And as big data gets bigger, the amount of untapped insights available from analyzing that day is also growing exponentially. Marketers covet those insights as a way to better understand and engage with their customers and ultimately drive revenue—but how do they get to it?

According to Gartner, organization that successfully integrate high-value, diverse new information types and sources into a coherent information management infrastructure will outperform their industry peers financially by more than 20 percent.* Fortunately, a new solution that combines Hortonworks Data Platform (HDP) with the expertise of eSage Group allows marketing professionals to extract value from Big Data, quickly and with relative ease.…

Today we announced Hortonworks Data Platform certification for Rackspace Private Cloud. In fact, we are the only Apache Hadoop distribution certified with Rackspace Private Cloud. The result of combining the power of enterprise-class Apache Hadoop in Hortonworks Data Platform (HDP) with Rackspace Private Cloud, is that organizations now have a secure, scalable environment to refine, explore and enrich their data using Hadoop in the cloud. With HDP, data can be processed from applications that are hosted on Rackspace Private Cloud environments, allowing you to quickly and easily obtain additional business insights from this information.…

By contributing to the OpenStack ecosystem, Hortonworks is supporting the open source community and facilitating adoption of 100-percent open source Apache Hadoop-based solutions in the cloud.  Now customers will be able to access an enterprise-ready Hortonworks Data Platform built for the cloud that alleviates the time and complexities of manually deploying a big data solution.…

Go to page:« First...1020...2930313233...40...Last »