The Hortonworks Blog

More from Jules S. Damji
Internet of Things (IoT) Potential and Process

It may seem obvious (or inevitable), but many companies are embracing the Internet of Things (IoT)—and for good reasons, notes Forbes’ Mike Kavis. For one, McKinsey Global Institute reports that IoT business will reach $6.2 trillion in revenue by 2025. And second, more and more objects are becoming embedded with sensors that communicate real-time data to data centers’ networks for processing, explain McKinsey’s Chui, Loffler, and Roberts.…

Speed, Scale, and SQL Semantics

Since its inception and graduation as a Top Level Project (TPL) from Apache Foundation Project (ASF) in September 2010, Apache Hive has been steadily improving—in speed, scale, and SQL semantics—to meet enterprise requirements for both interactive and batch queries at Hadoop scale.

It has become a defacto standard for SQL queries over petabytes of data stored in Hadoop. It is a compliant SQL engine that offers familiarity to developers over a comprehensive and familiar set of SQL semantics for Apache Hadoop.…

Haohui Mai is a member of technical staff at Hortonworks in the HDFS group and a core Hadoop committer. In this blog, he explains how to setup HTTPS for HDFS in a Hadoop cluster.

1. Introduction

The HTTP protocol is one of the most widely used protocols in the Internet. Today, Hadoop clusters exchange internal data such as file system images, the quorum journals, and the user data through the HTTP protocol.…

Chaos Before The Storm … and a Brief History

For its name and the metaphoric image it evokes, Apache Storm lives up to its purpose and promise: to ingest, absorb, and digest an avalanche of real-time data as a stream of unbounded discrete events at scale, speed, and success.

Before Storm, developers used a set of queues and workers to process a stream of real-time events. That is, events were placed on a worker queues, and worker threads plucked events and processed them—transforming, persisting or forwarding them to another queue for further processing.…

Sheetal Dolas is a Principal Architect at Hortonworks. As part of Apache Storm design patterns’ series blog, he explores three options for micro-batching using Apache Storm’s core APIs. This is the first blog in the series.

What is Micro-batching?

Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing [1]

Micro-batching in Apache Storm

In Apache Storm, micro-batching in core Storm topologies makes sense for performance or for integration with external systems (like ElasticSearch, Solr, HBase or a database).…

Hortonworks Software Engineers Vinod Kumar Vavilapalli (Apache Hadoop YARN committer) and Jian He (Apache YARN Hadoop committer) discuss Apache Hadoop YARN’s Resource Manager resiliency upon restart in this blog.This is their third blog post in our series on motivations and architecture for improvements to the Apache Hadoop YARN’s Resource Manager (RM) resiliency. Others in the series are:

Introduction Phase II – Preserving work-in-progress of running applications

ResourceManager-restart is a critical feature that allows YARN applications to be able to continue functioning even when the ResourceManager (RM) crash-reboots due to various reasons.…

“Data is to information society what fuel was to the industrial economy: the critical resource powering the innovations that people rely on,” write Victor Mayer-Schönberger and Kenneth Cukier, in Big Data. Today, big data fuels and engenders innovation of new products and services, according to Forrester.

Just as countries’ fuel repositories need protection and security because they can come under attack, so do companies’ big data repositories. “Companies, markets, and countries are increasingly under attack from cyber-criminals.…

Although the Hadoop Summit San Jose 2014 has come and gone, the invaluable content—keynotes, sessions, and tracks—is available here. We ’ve selected a few sessions for Hadoop developers, practitioners, and architects, curating them under Apache Hadoop YARN, the architectural center and the data operating system.

In most of the keynotes and tracks three themes resonated:

  • Enterprises are transitioning from traditional Hadoop to modern Hadoop 2.
  • YARN is an enabler, the central orchestrator that facilitates multiple workloads, runs multiple data engines, and supports multiple access patterns—batch, interactive, streaming, and real-time—in Apache Hadoop 2.
  • Tresata, a Hortonworks Certified Technology Partner, is a next-generation predictive analytics software company that helps enterprises monetize big data™they have moved to Hadoop . In this blog, Tresata’s Director of Marketing, Katie Levans, (@katie_levans) describes the value of HDP 2.1 certification and the benefit of their solution. 

    Last month Tresata announced the release of the third generation of their hugely successful software application TREE 3.3 and its subsequent certification on HDP 2.1.…

    Hadoop Summit Content Curation

    Although the Hadoop Summit San Jose 2014 has come and gone, the invaluable content—keynotes, sessions, and tracks—is available here. I’ve selected a few sessions below for Hadoop system administrators and dev-ops, curating them under a general Hadoop operations theme.

    Dev-ops engineers and system administrators know best that ease of operations and deployments can make or break a large Hadoop production cluster, which is why they care about all of the following:

    • how rapidly they can create or replicate a cluster;
    • how efficiently they can manage or monitor at scale;
    • how easily and programmatically they can extend or customize their operational scripts; and
    • how accurately they can foresee, forestall, or forecast resource starvation or capacity stipulation.

    Last Thursday we hosted the last of our seven Discover HDP 2.1 webinars, Using Apache Ambari to Manage Hadoop Clusters. Over 140 people attended and joined in the conversation.

    The speakers gave an overview of Apache Ambari, discussed new features, and showed an end-to-end demo.

    Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Jeff Sposetti (Hortonworks’ Senior Director of Product Management), and Mahadev Konar (Hortonworks’ Co-founder, Committer, and PMC Member for Apache Hadoop, Apache Ambari, and Apache Zookeeper) who presented the webinar.…

    This week we hosted a webinar entitled HDP Advanced Security: Comprehensive Security for Enterprise Hadoop. Over 135 people attended, prompting an informative discourse and a series of questions.

    The speakers outlined the HDP Advanced Security features and benefits in Hortonworks Data Platform and gave a demo. Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Balaji Ganesan (Hortonworks’ Senior Director, Enterprise Security Strategy), and Don Bosco Durai (Hortonworks’ Enterprise Security Architect).…

    We recently hosted the sixth of our seven Discover HDP 2.1 webinars, entitled Apache Storm for Stream Data Processing in Hadoop. Over 200 people attended the webinar and joined in the conversation.

    Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Himanshu Bari (Hortonworks’ Senior Product Manager for Storm), and Taylor Goetz (Hortonworks’ Software Engineer and Apache Storm Committer) who presented the webinar. The speakers covered:

    • Why use Apache Storm?

    We recently hosted the fifth of our seven Discover HDP 2.1 webinars, entitled Apache Solr for Hadoop Search. Over 200 people attended the webinar, prompting an informative discourse.

    The speakers outlined the Apache Solr overview and features, followed by a practical demo of how to process, index, search, and visualize server log data.

    Thanks to our presenters Justin Sears (Hortonworks’ Product Marketing Manager), Rohit Bakhshi (Hortonworks’ senior product manager), and Paul Codding (Hortonworks’ Solution Engineer) who presented the webinar.…

    This is the second in the series of blogs exploring how to write data-driven applications in Java using the Cascading SDK. The series are:

  • WordCount
  • Log Parsing
  • Historically, programming languages and software frameworks have evolved in a singular direction, with a singular purpose: to achieve simplicity, hide complexity, improve developer productivity, and make coding easier. And in the process, foster innovation to the degree we have seen today—and benefited from.

    Anyone among you is “young” enough to admit writing code in microcode and assembly language?…

    Go to page:12