The Hortonworks Blog

Introduction

Today, organizations use the Apache Hadoop™ stack in the form of a central data lake to store their critical datasets and power their analytical processing workloads. A key requirement for the Hadoop cluster and the services running on it is to be highly available and flawlessly continue to function while software is being upgraded. In the past, the Hadoop community has added enterprise features such as High Availability (HA) to various components of the stack, snapshots, improved disaster recovery etc.…

Today, we’re delighted to have a guest blog post from Cameron Peek, who leads Partnership and Strategic Sales  at CSC, one of our Global System Integration and Hortonworks Data Platform Resellers. 

In the spirit of providing our clients innovative, thought proving actionable content for consideration, Computer Sciences Corporation (CSC) is thrilled to present a two part webinar series with our Global partner, Hortonworks. In 2015, we find most of our clients have moved beyond exploring “What is big data?” and “How can I use big data?” and instead are now focused on “How can I ensure I am successful in my big data projects and see results quickly?”

The first webinar in the series will be presented this Thursday, March 19th at 10 am PST (register here) and will help attendees identify actionable next steps in their data analytics projects, no matter where they are today.…

Forrester recently called Apache Hadoop adoption “mandatory” for the enterprise. For most organizations, moving forward with Hadoop is no longer a question of if, but when. Hadoop-powered insight into big data is enabling market disruption in every industry and the market winners are those who handle that data most effectively and at the lowest cost.

As with any new platform, making decisions on how best to implement and for what purpose can be challenging.…

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.

Introduction

This is the second part of our blog-post series about anomaly detection from healthcare data. As described in part 1, our goal is to apply the personalized-PageRank algorithm to detect anomaly in healthcare payment records, specifically the publicly available Medicare-B dataset.

In this blog post, we demonstrate the technical steps to compute the similarity graph between medical providers at scale, using HDP and Apache Pig.…

On March 25th, Josh Lee, Global Director for Insurance Marketing at Informatica and Cindy Maike, General Manager, Insurance at Hortonworks, will be joining the Insurance Journal in a webinar on “How to Become an Analytics-Ready Insurer.”

Register for the Webinar on March 25th at 10am Pacific/1pm Eastern time

Josh and Cindy exchange perspectives on what “analytics ready” really means for insurers, and today we are sharing some of our views (join the webinar to learn more).…

Changes in technology and customer expectations create new challenges for how insurers engage their customers, manage risk information and control the rising frequency and severity of claims.

Carriers need to rethink traditional models for customer engagement. Advances in technology and the adoption of retail engagement models drive fundamental changes in how customers shop for and purchase insurance coverage. To engage with their customers, our insurance customers seek “omni-channel” insight and the ability to confidently recommend the next best action (NBA) to their customers.…

Apache Hive is the de facto standard for SQL in Hadoop with more enterprises relying on this open source project than any other alternative. Stinger.next, a community based effort, is delivering true enterprise SQL at Hadoop scale and speed.

With Hive’s prominence in the enterprise, security within Hive has come under greater focus from enterprise users. They have come to expect fine grain access control and auditing within Hive. Apache Ranger provides centralized security administration for Hadoop, and it enables fine grain access control and deep auditing for Apache components such as Hive, HBase, HDFS, Storm and Knox.…

Paul Boal, Director of Data Management & Analytics at Mercy, is our guest blogger. He shares his thoughts and insights about Apache Hadoop, Hortonworks Data Platform and Mercy’s journey to the Data Lake.

Technology at Mercy

Mercy has long been committed to using technology to improve medical outcomes for patients. We were among the first health care organizations in the U.S. to have a comprehensive, integrated electronic health record (EHR) providing real-time, paperless access to patient information.…

“Start with the business problem!” That’s Sanjay’s advice when it comes to building a successful Big Data solution. For those of you who have missed the first part of this video series, Sanjay Krishnamurthi, SVP and Chief Technology Officer at Informatica, and Shaun Connolly, Vice President Corporate Strategy at Hortonworks, address a number of hot Big Data topics throughout a series of nine videos.

Today, they talk about how Big Data projects need to be driven by the business and how IT solutions and frameworks such as Hadoop have to be integrated with the rest of the data systems.…

HDP 2.2 brings substantial innovations in Apache Hadoop YARN, enabling users of Apache Hadoop to efficiently store their data in a single repository and interact with it simultaneously using a wide variety of engines. This functionality makes YARN particularly attractive for the integration of many distributed Long-Running services.

In this release, we also introduced a new framework Apache™ Slider for easy on boarding of Long-Running service on top of YARN.…

This three part series is co-authored by Ofer Mendelevitch, director of data science at Hortonworks, and Jiwon Seo, Ph.D. and research assistant at Stanford University.

Introduction

PageRank[1]is the poster-child of graph algorithms, used by Google in its original search engine system to determine which web pages are most influential. The incredible success of PageRank led do increased interest and research in the field of graph algorithms, resulting in innovative extensions such as personalized PageRank [2].…

Cisco and Hortonworks established their official alliance back in 2013. Together, they have been bringing to life the vision of a single big data platform for the enterprise. As every industry is witnessing unprecedented quantities of data and a variety of new data types e.g. clickstream and behavior, machine and sensor, geographic data, server logs, sentiment and web…, Cisco and Hortonworks have been collaborating to empower companies with their data. Oftentimes, organizations need to optimize their IT infrastructure and free up their Enterprise Data Warehouse (EDW) to make the most of all of their data, building new analytic applications and moving towards the vision of the Data Lake.

This is the second post in a series exploring the theme of long-running service workloads in YARN. See for the introductory post.

Long-running services deployed on YARN are by definition expected to run for a long period of time—in many cases forever. Services such as Apache™ HBase, Apache Accumulo and Apache Storm can be run on YARN to provide a layer of services to end users, and they usually have a central master running in conjunction with an ApplicationMaster (AM).…

An increasing number of enterprises are either currently using or are planning to use cloud deployment models to expand their IT infrastructure for big data. We find that organizations are looking for an open and flexible platform that enables them to deploy big data and Hadoop solutions on-premises, in the cloud and in a hybrid environment.

Microsoft and Hortonworks have joined forces to help simplify and ease the transformation of your current Apache Hadoop deployment to hybrid cloud architecture.…

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system.

The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite.…