Hortonworks on Apache Hadoop


Hortonworks Data Platform 2.0 Alpha 2 now available: focus on performance

We are very pleased to announce the Alpha 2 release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha2) is now available for download!

A key focus in HDP 2.0 Alpha 2 is on performance as announced in the Stinger initiative, and includes a series of enhancements to the performance of Apache Hive for interactive SQL queries.  In fact HDP 2.0 Alpha 2 was used to perform the tests announced yesterday, showing a 45X performance increase using Hive.  There is much more to come but we are pleased with the early results, and encourage Hive users to take a look and continue to give us feedback.

Consistent with HDP 2.0 Alpha 1, this version is built from the developmental Apache Hadoop 2.0 line and includes Apache YARN, a next-generation resource-management and application framework that enables Hadoop to support an ever-expanding range of use cases. …

Read More

HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery

Note: Continued in part two

Your Twitter Archive

Twitter has a new feature, Your Twitter Archive, that enables any user to download their tweets as an archive. To view this feature, look at the bottom of the page at your account settings page. There should be an option for ‘Your Twitter archive,’ which will generate your tweets as a json/javascript web application and send them to you in email as a zip file.

Be patient: this process can take several days, in particular if you’ve lots of tweets (I personally have 24K tweets, and it took 4-5 days to get my tweets).

After a few hours or days, you’ll receive an email with a download link. Download your tweets, and unzip them to reveal their contents.

Digging In: ETL

There is a file called tweets.csv, but that is not the file we are interested in.…

Read More

Stinger Early Results: 45X Performance Increase for Hive

Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan

A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.

In order to get a fair assessment we styled our tests after the TPC Benchmark™ DS (TPC-DS). For this initial report, we provide detail around two of the most common use cases and as we execute more queries and make more improvements we will provide more detail.

Performance Tests & Environment

In this report we provide results for two of our performance queries.  In the first, we perform a star schema join where we load all the small tables in memory and do a scan through the fact table independently on all nodes.…

Read More

Touring Ambari

Hot on the heels of the release of the new version of Sandbox, I thought it would be worth a look at Ambari as it is now integrated into the Sandbox VM. You can download the Hortonworks Sandbox and try it out for yourself!

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It greatly simplifies and reduces the complexity of running Apache Hadoop. Ambari is a fully open-source, Apache project and graphical interface to Hadoop.

The Ambari Dashboard serves as a home page for your cluster, defining key metrics and linking you through to particular services on the cluster.

Heatmaps show which parts of your cluster are the least or most active, which can help with capacity and load management.

The Ambari Services interface lets you monitor cluster-wide services on your Hadoop cluster.

The Ambari Hosts interface lets you drill down to individual hosts that make up your cluster.…

Read More

Sandbox – Your Personal Hadoop Environment Gets Better!

We are excited to tell you about the newest release of the Hortonworks Sandbox.

The Hortonworks Sandbox provides the fastest onramp to Apache Hadoop with an easy-to-use, integrated learning environment and a functional personal Hadoop environment. The Sandbox takes the complexity out of Hadoop installation and set up by providing a fully functional virtual image. If you are evaluating Apache Hadoop or need an easy way to prove out use cases then the Sandbox is for you. With the Sandbox, you don’t have to go through the work required to set up Hadoop cluster or to configure Hadoop. Simply download the virtual machine.  Zero to Big Data in 15 minutes!

Here are the key enhancements available now:

Apache Hadoop Essentials Classroom Material

Are you new to Hadoop and need the answer to “What is Apache Hadoop?” Then the Hadoop Essentials material is for you.…

Read More

Week in Review: From Plastics to Windows

We’re wrapping up another busy week at Hortonworks towers. I say another, but actually this is my first week. So… it’s a hello from me, I’m Marc Holmes, Community Director. What have we been talking about this week?

Plastics and Hadoop: discuss! We started the week with a post from our VP of Products, Bob Page drawing an analogy to the growth of the plastics industry with the disruption to the database market driven by Hadoop, looking at the connections and differences to SQL and pointing out ‘what we don’t know yet’ on the evolution of use cases for Hadoop.

Hadoop and Windows sitting in a tree… Arun and Suresh highlighted the joint effort between Hortonworks and Microsoft to make Apache Hadoop run natively on Windows, and celebrated the community vote to move this work into the mainline trunk. We’re community-driven open source folk and we’re delighted not only by the code, but the spirit of community contribution throughout.…

Read More

Expanding the Apache Hadoop Community to Windows

This post co-authored by Arun Murthy.

It’s been an exciting time for the Apache Hadoop community with new and innovative projects happening around performance (Apache Tez) — part of the Stinger initiative — and security (Apache Knox). In addition Hortonworks recently announced the availability of the beta version of Hortonworks Data Platform for Windows.

One of the things we believe strongly in here at Hortonworks is community driven open source and, obviously, the bigger the community, the better. The community opens itself up to new members by the developmental choices it makes and last week the Apache Hadoop community voted to significantly expand itself by agreeing to accept enhancements into the core trunk that make Apache Hadoop run natively on the Microsoft Windows platforms including Windows Server and Windows Azure. These enhancements were the result of many, many months of joint engineering work from Microsoft and Hortonworks and we are glad to see the community accept and embrace them.…

Read More

HOWTO install Hadoop on Windows

Installing the Hortonworks Data Platform for Windows couldn’t be easier. Lets take a look at how to install a one node cluster on your Windows Server 2012 machine. // to let us know if you’d like more content like this.

To start, download the HDP for Windows MSI at http://hortonworks.com/thankyou-hdp11-win/. It is about 460MB, and will take a moment to download. Documentation for the download is available here.

As indicated in the documentation here, first we must install Microsoft Visual C++ 2010 Redistributable Package (x64), available here.

Download and install .NET from here if you haven’t already.

We need to setup Java, which you can get here. We need to setup JAVA_HOME, which Hadoop requires. Make sure to install Java to somewhere without a space in the path, “Program Files” will not work!

To setup JAVA_HOME, in the file browsers -> right click computer -> Properties -> Advanced System Settings -> Environment variables.…

Read More

Plastics, SQL and the Extensible Future of Hadoop

Plastics, SQL and the Extensible Future of Hadoop

Mr. McGuire: I just want to say one word to you. Just one word.

Benjamin: Yes, sir.


Mr. McGuire: Are you listening?

Benjamin: Yes, I am.

Mr. McGuire: Plastics.

 

The advice given by Mr. McGuire in 1967’s The Graduate was certainly prophetic — plastics has become one of the largest manufacturing industries in the U.S. (Today, Mr. McGuire would probably say “Data.” But this post isn’t about career choices.)

Plastics initially found itself taking on familiar roles, providing rough equivalents for materials that were more expensive, in low supply, or some other attribute that made plastics a viable alternative — materials like glass, wood and metal were commonly imitated. But plastics were often seen as a poor replacement. Eventually, two things happened: New uses were found that went far beyond existing use cases, and the technology got better at becoming more like the materials they mimicked.…

Read More

Seamless Reporting & Analytics for Apache Hadoop & Big Data Users

Jaspersoft, a Hortonworks certified technology partner, recently completed a survey on the early use of Apache Hadoop in the enterprise. The company found 38% of respondents require real-time or near real-time analytics for their Big Data with Hadoop. Also, within the enterprise, there is a diverse group of people who use Hadoop for such insights: 63% are application developers, 15% are BI report developers and 10% are BI admins or casual business users. Register for a free webinar to hear more.

So, for Hadoop users, the partnership between Hortonworks and Jaspersoft provides a good combination– Jaspersoft provides the ideal complement for reporting and analysis of Hadoop-based Big Data systems through a full suite of ETL, Apache Hive, and native Apache HBase connectors for low-latency data exploration. Not only does the company have an open source model that empowers users to deploy Big Data reporting and analytics quickly and cost-effectively, pre-defined reports make it easy for a wide group of users to gain and share immediate insight.…

Read More

Getting Ready for The Elephant Party in Europe

We are just under two weeks away from start of the first ever Hadoop Summit Europe and with all of the final preparations being made we thought we would highlight some of the not to be missed activities in and around the event. The event is filling fast but you can still register here.

Here are 10 great reasons to attend!

1)   Great track content – there are 35 informative sessions on Apache Hadoop and related technologies for you to choose from selected by the community and delivered by the experts themselves.

2)   Great keynotes – leading industry analyst Matt Aslett will present the opening keynote and we will also hear from open source veteran Shaun Connolly as well as Hortonworks CTO Eric Baldeschwieler

3)   Hadoop in the Enterprise expert panel – We will have a live panel discussion from industry leaders incuding eBay, HSBC and Neustar discussing how and why they use Apache Hadoop.…

Read More

Separating Open Source Signal from Enterprise Hadoop Noise

There have been many Apache Hadoop-related announcements the past few weeks, making it difficult to separate the signal from the marketing noise. One thing is crystal clear however… there is a large and growing appetite for Enterprise Hadoop because it helps unlock new insights and business opportunities in a way that was not previously technologically or economically feasible.

Enterprise and Open Source are NOT Mutually Exclusive

Dan Woods from Forbes, recently penned an article entitled “Why SQL Matters, the Limits of Open Source, and Other Lessons of EMC Greenplum’s Pivotal HD” where he paints a picture of enterprise and open source in opposite corners. As an example, he closes his article with:

 “If you are a CIO what do you choose? Open source ideology or products that are made to solve enterprise problems by enterprise companies?”

I take issue with that either/or stance; just look at Red Hat, JBoss, SpringSource, MySQL as well as the broad enterprise use of Apache Web Server and Apache Tomcat for examples of enterprise-class open source software.…

Read More

Putting the Elephant in the Window

 

For several years now Apache Hadoop has been fueling the fast growing big data market and has become the defacto platform for Big Data deployments and the technology foundation for an explosion of new analytic applications. Many organizations turn to Hadoop to help tame the vast amounts of new data they are collecting but in order to do so with Hadoop they have had to use servers running the Linux operating system. That left a large number of organizations who standardize on Windows (According to IDC, Windows Server owned 73 percent of the market in 2012 – IDC, Worldwide and Regional Server 2012–2016 Forecast, Doc # 234339, May 2012) without the ability to run Hadoop natively, until today.

We are very pleased to announce the availability of Hortonworks Data Platform for Windows providing organizations with an enterprise-grade, production-tested platform for big data deployments on Windows.…

Read More

Pig Eye for the SQL Guy

Cat Miller is an engineer at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

Introduction

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.…

Read More

Apache Hadoop YARN Meetup II @Hortonworks

Introduction

Hortonworks hosted the second Apache Hadoop YARN meetup at Hortonworks office in Palo Alto on last Friday (22 February 2013). Following the success with the first one, this meetup continues to enjoy a good attendance from the YARN community. About 40 joined the meetup in person and nearly another 30 attended via phone/webex.

Meetup sessions
Update from Yahoo!

The Yahoo! grid team responsible for YARN rollout on their clusters gave an update of the current deployments and their state. Robert Evans and others from their team threw some very impressive numbers about the YARN clusters – 10s of million jobs till now on YARN, averaging ~100,000 jobs on some clusters per day. Please go ahead and read their recent blog on Yahoo! developer network: Hadoop at Yahoo!: More Than Ever Before. They then fielded several questions from the community like any pain-points for the users during the upgrade, big issues that only surfaced at scale.…

Read More

Go to page:« First...23456...10...Last »