Hortonworks on Apache Hadoop


InfoQ: Hadoop and Metadata (Removing the Impedance Mis-match)

InfoQ has an article out today on HCatalog by Hortonworks’ own Alan Gates and Russell Jurney.

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

Check out the article at InfoQ: http://www.infoq.com/articles/HadoopMetadata

Read More

Search Hadoop with Search-Hadoop.com

As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!

Search-Hadoop.com searches across projects – JIRAs, source code, mailing lists, wikis, etc. so you can see design and API docs, as well as questions, answers and general documentation. Filtering by project is a big help – but search-hadoop also lets you see the similarities between projects.

Search Hadoop runs on Solr 3.6.1, but will be moving to Solr 4.0 this Fall. Solr 4.0, aka SolrCloud, is a fully distributed version of Solr (indices are sharded and replicated) that uses ZooKeeper for coordination.

The autocomplete feature is particularly cool. It offers several groups of suggestions separated by a lovely thin pink line, so one can easily pick the suggestion to follow.…

Read More

ZooKeeper 3.4.4 is Now Available

Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.

ZOOKEEPER-1419 Leader Election never settles for a 5 node cluster

ZOOKEEPER-1489 Data loss after truncate on transaction log

ZOOKEEPER-1412 java client watches inconsistently triggered on reconnect

ZOOKEEPER-1344 ZooKeeper client multi-update command is not considering the
Chroot request

ZOOKEEPER-1496 Ephemeral node not getting cleared even after client has exited

ZOOKEEPER-1437 Client uses session before SASL authentication complete

Stability of 3.4.4

As you might have noticed we have been marking all the previous 3.4.* releases as Alpha and beta. After having a couple of releases out (3.4.0, 3.4.1, 3.4.2, 3.4.3), we in the ZooKeeper community have decided to upgrade 3.4.4 as a stable release.

Acknowledgements

Thanks to everyone who contributed towards the release including our users who reported the bugs in 3.4.4.…

Read More

Meet the Committer, Part Two: Matt Foley

I hope you had fun pigging out to Hadoop with Alan Gates. We had interesting questions during the webinar and as always, your participation in these discussions will help us understand different use cases of Apache Pig and the growing community around this project. The recording is now available on our webinar site.

For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. Matt is a member of Hortonworks technical staff, Committer and PMC member for Apache Hadoop core project and will be our guest speaker on September 26, 2012 @10am PDT / 1pm EDT webinar: Deployment and Management of Hadoop Clusters with AMBARI.

Get to know Matt in this second installment of our “Meet the Committer” series.

Kim: Tell us your role with Apache Hadoop?

Read More

HCatalog Meetup at Twitter

Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

A central theme was using HCatalog to enable sharing and use of legacy data and diverse formats like TSV, JSON, RCFile, Protobuf, Thrift and Avro, among diverse tools like Pig, Hive, Cascading, SQL-H and JAQL.

A key issue discussed were the mechanics of HCatalog’s integration with Hive as the project develops and matures. Some HCatalog users use Hive, and some do not – but HCatalog relies on the Hive metastore regardless. As usual in open source, each organization has its own set of problems, perspectives and priorities, and the discussion centers around commonalities in finding a common path forward.…

Read More

Pig as Duct Tape, Part Three: TF-IDF Topics with Cassandra, Python Streaming and Flask

Series Introduction

Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.

Working code for this post as well as setup instructions for the tools we use and their environment variables are available at https://github.com/rjurney/enron-python-flask-cassandra-pig and you can download the Enron emails we use in the example in Avro format at http://s3.amazonaws.com/rjurney.public/enron.avro. You can run our example Pig scripts in local mode (without Hadoop) with the -x local flag: pig -x local.…

Read More

Hadoop Features Large at Stanford XLDB

Hadoop featured prominently at Stanford’s annual XLDB conference last week, as representatives from academia and industry gathered to discuss Extremely Large Databases. The conference program, with slides are available: http://www-conf.slac.stanford.edu/xldb2012/ProgramC.asp. A highly technical lineup presented on Big Data in biology and physics, and cloud computing and Hive in particular were topic areas.

Hortonworks’ own Ashutosh Chauhan @ashutoshchauhan, an Apache Pig, Hive and HCatalog committer, presented ‘Hive vs Pig: Similarities and Differences‘ (slides).…

Read More

Answer Big Questions with Big Data

Partner Webinar Series

On September 18 at 10am PT/1pm ET we join our partner Datameer in a webcast aimed at providing answers to some common questions we hear in the industry. Specifically, what are some of the use cases that big data analytics is perfect for?

By looking at some common uses we are seeing, you’ll be able to envision how you can leverage the analytics results from your own data. Ultimately these analytics will lead to uncovering ideas for new business approaches you can use for a huge competitive advantage.

Obviously you need to weigh in the costs required so you can determine if the payoff is worth the investment for your business. What should you be considering when you are trying to decide if Hadoop and big data analytics are going to pay off?

These questions will be the topic for our webinar on September 18 at 10am PT.…

Read More

My Summer Internship at Hortonworks

Hortonworks Summer Internship 2012

As a first time intern, I can undoubtedly say that Hortonworks was the perfect place for me to gain real world work experience and have the chance to team up with many incredibly talented, driven people. Of course, I didn’t get to fully interact with everyone in the company in the three months that I was here but even after such a short time it is clear to me that it is the welcoming atmosphere and the determined team here that have allowed Hortonworks to achieve so many goals in just over a year.

During this summer, I was awarded the opportunity to be part of something big, something that is gaining impressive momentum in the world of technology and will not be slowing down any time soon. I have received insightful information from people who are overflowing with innovative ideas for how to utilize the big data of today’s world and this has provided me with knowledge that I did not expect to gain from a big data company.…

Read More

Welcome Hortonworks Data Platform 1.1

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1.  We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.…

Read More

How To Take Big Data to the Cloud

Partner Webinar Series

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the Clouds

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform.…

Read More

Apache Hadoop YARN – NodeManager

Other posts in this series:
Introducing Apache Hadoop YARN
Apache Hadoop YARN – Background and an Overview
Apache Hadoop YARN – Concepts and Applications
Apache Hadoop YARN – ResourceManager
Apache Hadoop YARN – NodeManager

Apache Hadoop YARN – NodeManager


The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

NodeManager Components

  1. NodeStatusUpdater

On startup, this component registers with the RM and sends information about the resources available on the nodes. Subsequent NM-RM communication is to provide updates on container statuses – new containers running on the node, completed containers, etc.

In addition the RM may signal the NodeStatusUpdater to potentially kill already running containers.…

Read More

Twitter Analytics Presents Hadoop and Pig at UC Berkeley

Twitter Analytics presented their distributed infrastructure, including Hadoop and Pig, at a UC Berkeley iSchool special course called INFO 290: Analyzing Big Data with Twitter. Twitter is a major contributor to many Apache projects. The course was over-subscribed and was a great success, as students got to learn from practicing data scientists using Hadoop on truly massive datasets. The entire lecture series is available here.

Bill Graham @billgraham, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented an Introduction to Hadoop. His slides are available here. His presentation gives a comprehensive introduction to Apache Hadoop including its history, motivation, practice and operation.

Jonathan Coveney @jco, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented Pig at Twitter. Slides for this presentation are available here. His presentation gives a comprehensive explanation of Apache Pig‘s philosophy, use and intricacies.…

Read More

Meet the Committer, Part One: Alan Gates

Series Introduction

Hortonworks is on a mission to accelerate the development and adoption of Apache Hadoop. Through engineering open source Hadoop, our efforts with our distribution, Hortonworks Data Platform (HDP), a 100% open source data management platform, and partnerships with the likes of Microsoft, Teradata, Talend and others, we will accomplish this, one installation at a time.

What makes this mission possible is our all-star team of Hadoop committers. In this series, we’re going to profile those committers, to show you the face of Hadoop.

Alan Gates, Apache Pig and HCatalog Committer

Education is a key component of this mission. Helping companies gain a better understanding of the value of Hadoop through transparent communications of the work we’re doing is paramount. In addition to explaining core Hadoop projects (MapReduce and HDFS) we also highlight significant contributions to other ecosystem projects including Apache Ambari, Apache HCatalog, Apache Pig and Apache Zookeeper.…

Read More

Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

This four-part webinar series is now open for registration, and the schedule will include:

  • Wednesday, September 12 at 10:00 a.m.

Read More

Go to page:« First...89101112...Last »