User logs of Hadoop jobs serve multiple purposes. First and foremost, they can be used to debug issues while running a MapReduce application – correctness problems with the application itself, race conditions when running on a cluster, and debugging task/job failures due to hardware or platform bugs. Secondly, one can do historical analyses of the logs to see how individual tasks in job/workflow perform over time. One can even analyze the Hadoop MapReduce user-logs using Hadoop MapReduce(!) to determine any performance issues.…
The Hortonworks Blog
This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.
“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”
This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive.…
The closing date for submitting Speaking Abstracts for Hadoop Summit Europe is NOVEMBER 22. If you are interested in speaking this year, we encourage you to submit your topic today. The Call for Abstracts will close at midnight on November 22, 2013 (the final midnight on the planet… Kiribati time?).
Some notes about Summit
We will return to Amsterdam for Hadoop Summit Europe this year (April 2-3, 2014). Also, we have restructured the content for the event to make sure we, as a community, continue to deliver high value technical content.…
Join Hortonworks and Pactera for a Webinar on Unlocking Big Data’s Potential in Financial Services Thursday, November 21st at 12:00 EST.
Have you ever had your debit or credit card declined for seemingly no reason? Turns out, the rejections are not so random. Banks are increasingly turning to analytics to predict and prevent fraud in real-time. That can sometimes be an inconvenience for customers who are traveling or making large purchases, but it’s necessary inconvenience today in order for banks to reduce billions in losses due to fraud.…
This is the first of two posts examining the use of Hive for interaction with HBase tables. The second post is here.
One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself.…
I teach for Hortonworks and in class just this week I was asked to provide an example of using the R statistics language with Hadoop and Hive. The good news was that it can easily be done. The even better news is that it is actually possible to use a variety of tools: Python, Ruby, shell scripts and R to perform distributed fault tolerant processing of your data on a Hadoop cluster.…
This post is authored by Omkar Vinit Joshi with Vinod Kumar Vavilapalli and is the ninth post in the multi-part blog series on Apache Hadoop YARN – a general-purpose, distributed, application management framework that supersedes the classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters. Other posts in this series:
- Introducing Apache Hadoop YARN
- Apache Hadoop YARN – Background and an Overview
- Apache Hadoop YARN – Concepts and Applications
- Apache Hadoop YARN – ResourceManager
- Apache Hadoop YARN – NodeManager
- Running existing applications on Hadoop 2 YARN
- Stabilizing YARN APIs for Apache Hadoop 2
- Management of Application Dependencies
- Resource Localization in YARN: Deep Dive
In the previous post, we explained the basic concepts of LocalResources and resource localization in YARN.…
This post is the seventh in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:
- Apache Tez: A New Chapter in Hadoop Data Processing
- Data Processing API in Apache Tez
- Runtime API in Apache Tez
- Writing a Tez Input/Processor/Output
- Apache Tez: Dynamic Graph Reconfiguration
- Reusing containers in Apache Tez
- Introducing Tez Sessions
In Tez, we recently introduced the support of a feature that we call “Tez Sessions”.…
When I first started to understand what YARN is, I wanted to build an application to understand its core. There was already a great example YARN application called Distributed Shell that I could use as the shell (pun intended) for my experiment. Now I just needed an existing application that could provide massive reuse value by other applications. I looked around and I decided on MemcacheD.
This brief guide shows how to get MemcacheD up and running on YARN – MOYA if you will…
You’re going to need a few things to get the sample application operational.…
As we have known for some time, Hortonworks customers are already building a modern data architecture with Hadoop as the technology of choice for handling the data they have streaming in from all directions. They care that it matches their needs, integrates with their existing infrastructure and solves real problems with flexibility.…
One of the great things about working in open source development is working with other experts round the work on big projects – and then having the results of that work in the hands of users within a short period of time.
This is why I’m really excited about the Rackspace announcement of their HDP-based Big Data offerings, both “on-prem” and in cloud. Not just because its partners of us offering a service based on Hadoop, but because it shows how Hadoop integration with OpenStack has reached a point where it’s ready for production use.…
The Apache Knox community announced the release of the Apache Knox Gateway (Incubator) 0.3.0. We, at Hortonworks, are excited about this announcement.
The Apache Knox Gateway is a REST API Gateway for Hadoop with a focus on enterprise security integration. It provides a simple and extensible model for securing access to Hadoop core and ecosystem REST APIs.
Apache Knox provides pluggable authentication to LDAP and trusted identity providers as well as service level authorization and more. …
With the attention of the Hadoop community on Strata/Hadoop World in New York this week, it’s seems an appropriate time to give everyone an early update on continued community development of Apache Hive. This progress well and truly cements Hive as the standard open-source SQL solution for the Apache Hadoop ecosystem for not just extremely large-scale, batch queries but also for low-latency, human-interactive queries.
You can catch me at our session ‘Apache Hive & Stinger: Petabyte Scale SQL, IN Hadoop’ along with Owen and Alan where we’ll be happy to dive into more of the details.…
- Based on HDP 2.0
- Easy enablement of Ambari and Hbase
- Updated tutorial navigation
This version of the Sandbox provides you a complete HDP 2.0 environment. Your own personal single-node Hadoop cluster where you can explore the new features and enhancements of HDP 2.0, including YARN, the improvements to Hive that were delivered by the Stinger initiative, along with the updates to Hbase, Pig, and Ambari.In fact, our Sandbox has all of the most current releases of the various Apache Projects — like Hive 12, HBase 96, and Hadoop 2.2.…
You’re a Java developer, you use Spring and you’re just itching to get your arms around some big data. Well, now you can do that even easier than before as we announced this morning that Spring is now certified for Hortonworks Data Platform.
To celebrate this development, we have a community tutorial for Sandbox (1.3 currently) that shows you how to use Spring XD to collect data streamed from Twitter, load into HDFS and then run simple sentiment analysis with Apache Hive.…