The Hortonworks Blog

Posts categorized by : HDP

Are you a Hadoop hot shot?  Are you the one everyone looks to for help on their Hadoop projects? Are you looking to showcase your talent to the world?

Then just maybe we have a great option for you. We recently published the Hortonworks Sandbox tutorials on GitHub. Now it’s your turn. We invite you to add your own Hadoop tutorials or to improve on the ones that we’ve published.…

YARN and the Hortonworks Data Platform 2.0 enables one Hadoop cluster to share data and analytical processing capabilities across the Enterprise organization. Organizations can use the Hortonworks Data Platform 2.0 to:

  • Pool all enterprise data into one scalable and reliable storage platform
  • Enable all analytical processing IN the data platform
  • Provide access to this data and processing across all business units

The Capacity Scheduler (CS) ensures that groups of users and applications will get a guaranteed share of the cluster, while maximizing overall utilization of the cluster.…

There’s an old proverb you’ve likely heard about blind men trying to identify an elephant. Depending on the version of the proverb you’ve heard the elephant is misidentified variously as rope, walls, pillars, baskets, brushes and more. Oddly, no-one identified it as a next-generation enterprise data platform but I guess it is an old proverb.

The Hadoop elephant is a platform though, and as such the proverb holds true. Depending on your perspective, it has different capabilities, components and integration points to meet your requirements.…

This is a guest blog post from Gary Nakamura, CEO at our partner Concurrent, Inc. discussing Cascading Pattern and the new Hadoop tutorial they have written for the Hortonworks Sandbox. This is one of the first tutorials aimed at more experienced crowd. Enjoy!

Cascading Pattern: Deploy Predictive Models on Hadoop in minutes.

Cascading Pattern signifies an important milestone for Cascading as we continue our mission of driving innovation and to simplify Big Data application development.…

Albert Einstein is credited with saying that he doesn’t worry about the future because it would arrive soon enough. We don’t worry the future either — we focus on building it. And today, we are delighted to release the Hortonworks Data Platform 2.0 Beta Sandbox. This is the single-node VM based on the HDP 2.0 Beta release. This release is in the easy-to-use Sandbox form factor and allow you to easily work with a stable, reliable v2 of Hadoop.…

This post is the second in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

Overview

Apache Tez models data processing as a dataflow graph, with the vertices in the graph representing processing of data and edges representing movement of data between the processing.…

With HDP 1.3 and HDP 2.0 Beta, we introduced the ability to create snapshots to protect important enterprise data sets from user or application errors.

HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system and are:

  • Performant and Reliable: Snapshot creation is atomic and instantaneous, no matter the size or depth of the directory subtree
  • Scalable: Snapshots do not create extra copies of blocks on the file system.

Syncsort, a technology partner with Hortonworks, helps organizations propel Hadoop projects with a tool that makes it easy to “Collect, Process and Distribute” data with Hadoop. This process, often called ETL (Exchange, Transform, Load), is one of the key drivers for Hadoop initiatives; but why is this technology a key enabler of Hadoop? To find out the answer we talked with Syncsort’s Director Of Strategy, Steve Totman, a 15 year veteran of data integration and warehousing, provided his perspective on Data Warehouse Staging Areas.…

He loves me, he loves me not… using daisies to figure out someone’s feelings is so last century. A much better way to determine whether someone likes you, your product or your company is to do some analysis on Twitter feeds to get better data on what the public is saying. But how do you take thousands of tweets and process them?  We show you how in our video – Understand your customers’ sentiments with Social Media Data – that you can capture a Twitter stream to do Sentiment Analysis.…

This post is the first in our series on the motivations, architecture and performance gains of Apache Tez for data processing in Hadoop. The series has the following posts:

In this post we introduce the motivation behind Apache Tez (http://incubator.apache.org/projects/tez.html) and provide some background around the basic design principles for the project.…

As part of HDP 2.0 Beta, YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines.  This also streamlines MapReduce to do what it does best, process data.  With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management.

In this blog post we’ll walk through how to plan for and configure processing capacity in your enterprise HDP 2.0 cluster deployment.…

The upcoming Hive 0.12 is set to bring some great new advancements in the storage layer in the forms of higher compression and better query performance.

Higher Compression

ORCFile was introduced in Hive 0.11 and offered excellent compression, delivered through a number of techniques including run-length encoding, dictionary encoding for strings and bitmap encoding.

This focus on efficiency leads to some impressive compression ratios. This picture shows the sizes of the TPC-DS dataset at Scale 500 in various encodings.…

The Stinger Initiative is Hortonworks’ community-facing roadmap laying out the investments Hortonworks is making to improve Hive performance 100x and evolve Hive to SQL compliance to simplify migrating SQL workloads to Hive.

We launched the Stinger Initiative along with Apache Tez to evolve Hadoop beyond its MapReduce roots into a data processing platform that satisfies the need for both interactive query AND petabyte scale processing. We believe it’s more feasible to evolve Hadoop to cover interactive needs rather than move traditional architectures into the era of big data.…

We hosted a webinar on YARN a couple of weeks ago (see the slides and playback here). As you might expect, there was a lot of great questions and here is a set of answers to those questions.

Our next YARN-oriented Office Hours online on Sept 11th at 2pm PST. Join us on Meetup!

Who is using YARN and what benefits have they received from it?

On great public example of in production use of YARN, is at Yahoo!.…

This guest post from John Haddad, Director of Product Marketing at Informatica Corporation. He has over 25 years’ experience designing, building, integrating and marketing enterprise applications. His current focus is helping organizations get the most business value from Big Data by delivering timely, trusted, and relevant data across the extended enterprise.

Why is it so important for companies today to adopt a modern data architecture and why is next generation data integration on Apache Hadoop such a critical component?…

Go to page:« First...910111213...Last »