From the Dev Team

Follow the latest developments from our technical team

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  • Reliability and High Availability (all data always available, and recovery from failures is quick)
  • Autonomous operation (minimum operator intervention)
  • Wire compatibility (to support rolling upgrades across a couple of versions at least)
  • Cross data-center replication (for disaster recovery)
  • Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  • Monitoring and Diagnostics (which regionserver is hot or what caused an outage)
  • Significant work has happened in each of the areas outlined above in the 0.94 and 0.96 (currently trunk) branches.…

    HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform.  HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

    The HBase community is moving forward aggressively, improving HBase in many ways.  …

    In this blog, I’ll cover how we tested Full Stack HA with NameNode HA in Hadooop 1 with Hadoop and HBase as components of the stack.

    Yes, NameNode HA is finally available in the Hadoop 1 line. The test was done with Hadoop branch-1 and HBase-0.92.x on a cluster of roughly ten nodes. The aim was to try to keep a really busy HBase cluster up in the face of the cluster’s NameNode repeatedly going up and down.…

    Introduction

    The Apache Hadoop YARN meetup at Hortonworks on October 12, 2012 we previously announced was a resounding success. We had a very good turnout of around seventy people from the community.

    Meetup sessions Deployments at Yahoo!

    The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.

    • described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
    Series Introduction

    Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

    In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks.…

    There will be a Pig meetup at Strata NYC/Hadoop World, at 6:30PM on Wed, Oct 24th in the Bryant Room of the Hilton New York. This will also be the inaugural meeting of the NYC Pig User Group, which Doug Daniels of Pig contributor Mortar Data was good enough to organize. We look forward to future Pig meetups in NYC!

    Hortonworks’ own Daniel Dai @daijy, VP of Apache Pig, will present on new features in Pig 0.11.…

    Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

    YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo.…

    In a recent post we used Pig to summarize documents via the Term-Frequency, Inverse Document Frequency (TF-IDF) algorithm.

    In this post, we’re going to turn that code into a Pig macro that can be called in one line of code:

    1 2 import 'tfidf.macro'; my_tf_idf_scores = tf_idf(id_body, 'message_id', 'body');

    Our macro, in filename tfidf.macro looks just like our pig script, with a couple of new lines. Note the use of macro variables for input and output preceded with the ‘$’ character: $in_relation, $out_relation, $id_field and $text_field.…

    Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

    On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups.…

    The need for a ToJson EvalFunc

    When integrating Pig with different NoSQL ‘databases,’ or when publishing data from Hadoop, it can be convenient to JSONize your data. Although Pig has JsonStorage, there hasn’t been a ToJson EvalFunc. This has been inconvenient, as in our post about Pig and ElasticSearch, such that for creating JSON for ElasticSearch to index, tricks like this were necessary:

    1 2 3 4 5 6 store enron_emails into '/tmp/enron_emails_elastic' using JsonStorage(); json_emails = load '/tmp/enron_emails_elastic' AS (json_record:chararray);   /* Now we can store our email json data to elasticsearch for indexing with message_id.…

    InfoQ has an article out today on HCatalog by Hortonworks’ own Alan Gates and Russell Jurney.

    Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us?…

    Apache ZooKeeper release 3.4.4 is now available. This is a bug fix release including 50 bug fixes. Following is a summary of the critical issues fixed in the release.

    ZOOKEEPER-1419 Leader Election never settles for a 5 node cluster

    ZOOKEEPER-1489 Data loss after truncate on transaction log

    ZOOKEEPER-1412 java client watches inconsistently triggered on reconnect

    ZOOKEEPER-1344 ZooKeeper client multi-update command is not considering the Chroot request

    ZOOKEEPER-1496 Ephemeral node not getting cleared even after client has exited

    ZOOKEEPER-1437 Client uses session before SASL authentication complete

    Stability of 3.4.4

    As you might have noticed we have been marking all the previous 3.4.* releases as Alpha and beta.…

    Representatives from Twitter, Yahoo, LinkedIn, Hortonworks and IBM met at Twitter HQ on Thursday to talk HCatalog. Committers from HCatalog, Pig and Hive were on hand to discuss the state of HCatalog and its future.

    Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

    A central theme was using HCatalog to enable sharing and use of legacy data and diverse formats like TSV, JSON, RCFile, Protobuf, Thrift and Avro, among diverse tools like Pig, Hive, Cascading, SQL-H and JAQL.…

    Series Introduction

    Apache Pig is a dataflow oriented, scripting interface to Hadoop. Pig enables you to manipulate data as tuples in simple pipelines without thinking about the complexities of MapReduce.

    But Pig is more than that. Pig has emerged as the ‘duct tape’ of Big Data, enabling you to send data between distributed systems in a few lines of code. In this series, we’re going to show you how to use Hadoop and Pig to connect different distributed systems to enable you to process data from wherever and to wherever you like.…

    Other posts in this series: Introducing Apache Hadoop YARN Apache Hadoop YARN – Background and an Overview Apache Hadoop YARN – Concepts and Applications Apache Hadoop YARN – ResourceManager Apache Hadoop YARN – NodeManager

    Apache Hadoop YARN – NodeManager

    The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.…

    Go to page:« First...10...1314151617