The Hortonworks Blog

Posts categorized by : Hadoop

Paul Boal, Director of Data Management & Analytics at Mercy, is our guest blogger. He shares his thoughts and insights about Apache Hadoop, Hortonworks Data Platform and Mercy’s journey to the Data Lake.

Technology at Mercy

Mercy has long been committed to using technology to improve medical outcomes for patients. We were among the first health care organizations in the U.S. to have a comprehensive, integrated electronic health record (EHR) providing real-time, paperless access to patient information.…

HDP 2.2 brings substantial innovations in Apache Hadoop YARN, enabling users of Apache Hadoop to efficiently store their data in a single repository and interact with it simultaneously using a wide variety of engines. This functionality makes YARN particularly attractive for the integration of many distributed Long-Running services.

In this release, we also introduced a new framework Apache™ Slider for easy on boarding of Long-Running service on top of YARN.…

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.

Introduction

PageRank[1]is the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led do increased interest and research in the field of graph algorithms, resulting in innovative extensions such as personalized PageRank [2].…

Cisco and Hortonworks established their official alliance back in 2013. Together, they have been bringing to life the vision of a single big data platform for the enterprise. As every industry is witnessing unprecedented quantities of data and a variety of new data types e.g. clickstream and behavior, machine and sensor, geographic data, server logs, sentiment and web…, Cisco and Hortonworks have been collaborating to empower companies with their data. Oftentimes, organizations need to optimize their IT infrastructure and free up their Enterprise Data Warehouse (EDW) to make the most of all of their data, building new analytic applications and moving towards the vision of the Data Lake.

This is the second post in a series exploring the theme of long-running service workloads in YARN. See for the introductory post.

Long-running services deployed on YARN are by definition expected to run for a long period of time—in many cases forever. Services such as Apache™ HBase, Apache Accumulo and Apache Storm can be run on YARN to provide a layer of services to end users, and they usually have a central master running in conjunction with an ApplicationMaster (AM).…

An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure for big data. We find that organizations are looking for an open and flexible platform that enables them to deploy big data and Hadoop solutions on-premises, in the cloud and in a hybrid environment.

Microsoft and Hortonworks have joined forces to help simplify and ease the transformation of your current Apache Hadoop deployment to hybrid cloud architecture.…

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system.

The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite.…

Managing online security for companies is a big task. In a world of increasing cyber threats, the risks to financial organizations are greater than they have ever been. Data breaches result not only in financial loss from data theft and misuse, but in significant reputation damage to the organizations that experience them. How can such organizations quickly and accurately identify risks to protect their data, their assets, and their customers? Threats to your network and vital data sets are constantly evolving to be more sophisticated, which makes them more difficult to detect, especially when you are relying on traditional tools.…

Leading enterprise organizations have concluded that YARN-enabled Hadoop is foundational to their modern data architectures. These companies subscribe with Hortonworks (and implement Hortonworks Data Platform) to bring additional types of data under management, merge those with legacy datasets, and unlock new business insight.

But don’t take our word for it.

Watch these brief videos and hear our customers describe how a data-first approach is transforming their businesses.

Advertising

Luminar is the leading big data analytics and modeling provider uniquely focused on delivering actionable insights on U.S.…

This is the third post in a series exploring recent innovations in the Hadoop ecosystem that are included in Hortonworks Data Platform (HDP) 2.2. In this post, we introduce the theme of supporting rolling upgrades and downgrades of a Hadoop YARN cluster.

HDP 2.2 offers substantial innovations in Apache™ Hadoop YARN, enabling Hadoop users to efficiently store and interact with their data in a single repository, simultaneously using a wide variety of engines.…

Hortonworks provides enterprise Hadoop for the telecommunications service provider, and Hortonworks Data Platform (HDP) is architected from the ground up with the centralized YARN-based architecture and core enterprise services for data governance, security and cluster operations that can revolutionize your telecommunications business.

As the originators of Hadoop, leaders in the developer community, and partners for your success, nobody is better to help you become a data-centric telecommunications enterprise.

Hortonworks supports most of the largest North American carriers.…

As a data scientist working with Hadoop, I often use Apache Hive to explore data, make ad-hoc queries or build data pipelines.

Until recently, optimizing Hive queries focused mostly on data layout techniques such as partitioning and bucketing or using custom file formats.

In the last couple of years, driven largely by the innovation of the Hive community around the Stinger initiative, Hive query time has improved dramatically, enabling Hive to support both batch and interactive workloads at speed and at scale.…

The Apache HBase community has released Apache HBase 1.0.0. Seven years in the making, it marks a major milestone in the Apache HBase project’s development, offers some exciting features and new API’s without sacrificing stability, and is both on-wire and on-disk compatible with HBase 0.98.x.

In this blog, which is a cross post from from Apache HBase Blog, we look at the past, present and future of Apache HBase project.…

In this guest blog, Kumar Srivastava, senior director of product management at ClearStory Data, shares his thoughts on ClearStory’s integration with Hortonworks Data Platform (HDP)

We are excited to be working with and announcing ClearStory Data’s integration with Hortonworks Data Platform (HDP) during Strata + Hadoop World 2015. This partnership with Hortonworks is significant as it brings ClearStory’s business-ready, fast-cycle, scalable analysis on Hadoop Data Lakes and specifically on the Hortonworks Data Platform (HDP).…

This is a unique moment in time. Fueled by open source, Apache Hadoop has become an essential part of the modern enterprise data architecture and the Hadoop market is accelerating at an amazing rate.

The impressive thing about successful open source projects is the pace of the “release early, release often” development cycle, also known as upstream innovation. The process moves through major and minor releases at a regular clip and the downstream users get to pick the releases and versions they want to consume for their specific needs.…