Hortonworks on Apache Hadoop


Apache Hadoop Patterns of Use: Refine, Enrich and Explore

“OK, Hadoop is pretty cool, but exactly where does it fit and how are other people using it?”  Here at Hortonworks, this has got to be the most common question we get from the community… well that and “what is the airspeed velocity of an unladen swallow?”

We think about this (where Hadoop fits) a lot and have gathered a fair amount of expertise on the topic.  The core team at Hortonworks includes the original architects, developers and operators of Apache Hadoop and its use at Yahoo, and through this experience and working within the larger community they have been privileged to see Hadoop emerge as the technological underpinning for so many big data projects. That has allowed us to observe certain patterns that we’ve found greatly simplify the concepts associated with Hadoop, and our aim is to share some of those patterns here.

Read More

Where are Hortonworkers? Events and Meetups 8th April to 22nd April

Hortonworkers are out there – here is a rundown of events and meet ups we’ll be at in the next couple of weeks and we hope we’ll see you there. Did we miss any? Want us to attend your event? Let us know!

Big Data Innovation Summit

April 10-11, 2013, San Francisco, CA

http://theinnovationenterprise.com/summits/big-data-innovation-summit-april-2013-san-francisco

Spring into April and jump into Big Data! Be sure to meet us at Big Data Innovation Summit by the bay. We’re excited to have Alan Gates, co-founder of Hortonworks, presents on a couple of really exciting talks and we hope you can join us.

  •  April 11 @9:30am: Coordinating the Many Tools of Big Data in Hadoop
  •  April 11 @ 12:30pm: Hadoop Now, Next and Beyond
  •  April 11 @ 2:00pm: Roundtable Session: Use Case Patterns: Horizontal or Vertical

As a global sponsor, we’ll also be exhibiting. Look for us in the exhibit area and meet members of the Hortonworks team, who will be happy to discuss any questions you have on Hadoop and Hortonworks.…

Read More

Week in Review: Falcon, Hadoop Momentum and BFFs Forever!

More of a 2 weeks in review this time around owing to the Easter break. So what’s been happening?

Falcon bringing Data Lifecycle Management for Hadoop. The big news this week was the newly approved Apache Software Foundation incubator project – Falcon. The project was initiated by the team at InMobi and engineers from Hortonworks towers with the intent of simplifying data management through a data lifecycle management framework. Something for everyone then. More on Falcon here. Once again, it’s a great example of community driven open source driving the innovation that matters, or as Mohit Saxena of InMobi said:

Want to be BFFs with Hortonworks? According to this article on TechWorld, everyone does, and Neustar details why. We’re flattered by the sentiment and we’d love to be your friend. You can ‘Like’ us over here.

Market Momentum. So, with all of the innovation and buzz around Hadoop and Hortonworks, what does that mean for you, me, or anyone looking to dip a toe in the water?…

Read More

Integrating Apache Hadoop and SAP

With any enterprise software implementation, the challenge is often the integration of a chosen system with existing enterprise systems architecture. One such existing investment may be an ERP (and related) systems such as those provided by SAP. In this real-world instance, SAP partnered with Hortonworks to enable integration of Apache Hadoop into SAP Real-Time Data Platforms using Hortonworks Data Platform to facilitate business intelligence and analysis of Big Data.

The business challenges at hand will be familiar to everyone and are a great fit for a Hadoop solution. These are:

  • Data does not fit neatly in a relational format. The customer gathers more than one hundred million surveys each year. The most valuable data is in the “comments” field which is unstructured and therefore not analyzed.
  • The business cannot view data across departments. Customer training data, for example, is not typically joined across departments with the call center’s CRM application to help tailor a support call to the customer’s expertise.

Read More

Big Data Defined

‘Big Data’ has become a hot buzzword, but a poorly defined one. Here we will define it.

Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets:

In information technology, big data[1][2] is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.

It is better to define ‘Big Data’ in terms of opportunity, in terms of transformative economics. Big Data is the opportunity space created by new open source, distributed systems from the consumer internet space.

Specifically, a Big Data system has four properties:

  • It uses local storage to be fast but inexpensive
  • It uses clusters of commodity hardware to be inexpensive
  • It uses free software to be inexpensive
  • It is open source to avoid expensive vendor lock-in

Cheap storage means logging enormous volumes of data to many disks is easy.…

Read More

Project Falcon: Tackling Hadoop Data Lifecycle Management via Community Driven Open Source

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.

 

Falcon’s goal is to simplify data management on Hadoop and achieves this by providing important data lifecycle management services that any Hadoop application can rely on. Instead of hard-coding complex data lifecycle capabilities, apps can now rely on a proven, well-tested and extremely scalable data management system built specifically for the unique capabilities that Hadoop offers.…

Read More

Keynotes from Hadoop Summit Amsterdam 2013

The slides and videos from Hadoop Summit in Amsterdam have begun to flow so you can enjoy the sessions.

Whilst you’re thinking about which sessions to watch and read, then we suggest taking a look at the keynotes for the event:
  • What is the point of Hadoop? (VideoSlides)
  • Matt Aslett, Research Director, Data Management and Analytics, 451 Research
  • Real-World insight into Hadoop in the Enterprise (Video)
  • Panel featuring HSBC, eBay, Neustar and More
We hope you enjoy these sessions, and the content from the tracks. Let us know in the comments! And don’t forget that there is plenty of time to register for Hadoop Summit San Jose 2013.

Read More

Hadoop Market Momentum and You

On 27th March, the Wall Street Journal published an article ‘VCs Bet Big Bucks on Hadoop’ and it seems clear that the market is going to be huge. But what does that mean to you and your personal skills investment? Here’s our view:

Hadoop is HOT

Hadoop is incredibly hot right now as the number of available jobs continues to grow enormously (hey – we even have a bunch of our own right here).

Indeed’s Job Trends shows Hadoop as 7th hottest skill and it’s in great company alongside those app development skills such as iOS, Android and jQuery. I guess that’s to be expected of course: insights from big data is the fuel to smartest apps of the future.

The Hadoop trend itself is fairly clear. In growth terms, that is pretty explosive!

 

A quick search on LinkedIn will pull back around 1200 Hadoop jobs right now (it was 1281 when I checked).…

Read More

Hadoop Summit North America 2013: Community Choice Results

And the voting is over and the results are in for the Community Choice program of the Hadoop Summit San Jose 2013.

With over 300 sessions, and around 6000 users casting more than 15000 votes there was a lot of excitement to participate and influence the results - thanks to everyone for your contribution. At the end of the process, the selectees are:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

Congratulations to the selectees for each track, and a further honorable mention to Sears for winning the ‘Longest Session Title So Far’ which was a surprisingly hard fought contest!…

Read More

Understanding Hadoop 2.0

In this post, we’ll explain the difference between Hadoop 1.0 and 2.0. After all, what is Hadoop 2.0? What is YARN?

For starters – what is Hadoop and what is 1.0? The Apache Hadoop project is the core of an entire ecosystem of projects. It consists of four modules (see here):

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Hadoop 1.0 is based on the Hadoop .20.205 branch (it went 0.18 -> 0.19 -> 0.20 -> 0.20.2 -> 0.20.205 -> 1.0). Hard to follow? Check out this chart. Not hard for an open source developer, but obscure for an enterprise product – so everyone agreed to call 0.20.205 ’1.0′, the project having matured to that point.…

Read More

Week in Review: Sandboxes, HDP 2.0 Alpha 2, Hive Performance and Summits

 It’s almost time for that final drive home of the week, and what a week it has been with a few new releases, a summit, and a little bit of technical fun. Here’s what happened:

New Sandbox Release. Yes, your favorite Hadoop VM image just got even better. Cheryle took us through the new features which included Ambari integration and Russell followed up with a quick tour of Ambari. There’s still plenty of time to download Sandbox for a weekend of data crunching fun.

HDP 2.0 Alpha 2 was released. This preview release demonstrates some of the performance improvements in store for the final HDP 2.0 release via YARN, enhancements to Hive per the Stinger Initiative, and Apache Tez. Just before the release, we posted some early test results which showed a 45X (yes, that’s forty five) performance improvement for Hive interactive queries.…

Read More

Hadoop Summit 2013 Amsterdam – It’s A Wrap!

We want to take a moment to thank everyone who attended the Hadoop Summit in Amsterdam - THANK YOU! With nearly 500 people registered for the event we think we can safely say is was a big success. We’ve had overwhelming support to do it again next year – so watch this space.

The awesome Beurs Van Berlage venue set us up for a series of fantastic conversations and really well attended sessions and talks as Hadoop continues to explode onto the enterprise scene . Outside of the main tracks, there was great attendance for NLHUG and BoF talks, and kudos to the 10 presenters who ran those lightning talks. Finally, the customer panel was also well received, with great practical advice on adopting Hadoop from HSBC, Neustar and eBay.

But of course it wouldn’t be an event without a party, and we had a great time at the Heineken Experience (from what we can remember).  …

Read More

Hadoop Training on Windows, in Europe and Palo Alto

Over the last several weeks, Hortonworks has made a number of announcements regarding the Hortonworks Data Platform (HDP), including the upcoming release of HDP on Windows, the only Apache Hadoop distribution available on Microsoft Windows. We’ve been busy expanding out Hadoop training offerings: we now offer classes for HDP on Windows, you can find training in Europe through our global training partners and you can join us for Apache Hadoop courses in our new corporate headquarters, where you can have lunch with one of the committers. We continue to believe that community-driven open source is the fastest way to innovation and our the classes we offer on Apache Hadoop provide the best way to build the skills that are in demand — not skills on proprietary implementations but rather skills on open source Apache Hadoop.

Hadoop Training for HDP on Windows

Hortonworks is the only Hadoop distribution to run on Windows and we’ve released the Understanding Apache Hadoop on Microsoft Windows course.

Read More

Microsoft’s Contributions to the Stinger Initiative and Apache Hive

Guest blog post from Eric Hanson, Principal Program Manager, Microsoft

Hadoop had a crazy and collaborative beginning as an OSS project, and that legacy continues. There have been over 1,200 contributors across 80 companies since its beginning. Microsoft has been contributing to Hadoop since October 2011, and we’re committed to giving back and keeping it open.

Our first wave of contributions, in collaboration with Hortonworks, has been to port Hadoop to Windows, to enable it both for our HDInsight service on Windows Azure and for on-premises Big Data installations on Windows. Now, we’re starting to contribute to the Stinger initiative to dramatically speed up Hive and make it more enterprise-ready.

Contribution to the core of Apache Hadoop through Stinger

Our main activity in Stinger right now is around Tez, and vectorized query execution. One of our developers, Mike Liddell, has experience with DAG-based computations in Microsoft’s internal Dryad-LINQ effort, and has just joined Tez as a founding committer.…

Read More

HOWTO use Hive to SQLize your own Tweets – Part Two: Loading Hive, SQL Queries

In part one of this series, we covered how to download your tweet archive from Twitter, ETL it into json/newline format, and to extract a Hive schema. In this post, we will load our tweets into Hive and query them to learn about our little world.

To load our tweet-JSON into Hive, we’ll use the rcongiu Hive-JSON-Serde. Download and build it via:

wget http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar
mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api \
  -Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar
mvn package

Find the jar it generated via:

find .|grep jar
./target/json-serde-1.1.4-jar-with-dependencies.jar
./target/json-serde-1.1.4.jar

Run hive, and create our table with the following commands:

add jar /path/to/my/Hive-Json-Serde/target/json-serde-1.1.4-jar-with-dependencies.jar;

create table tweets (
   created_at string,
   entities struct <
      hashtags: array ,
            text: string>>,
      media: array ,
            media_url: string,
            media_url_https: string,
            sizes: array >,
            url: string>>,
      urls: array ,
            url: string>>,
      user_mentions: array ,
            name: string,
            screen_name: string>>>,
   geo struct <
      coordinates: array ,
      type: string>,
   id bigint,
   id_str string,
   in_reply_to_screen_name string,
   in_reply_to_status_id bigint,
   in_reply_to_status_id_str string,
   in_reply_to_user_id int,
   in_reply_to_user_id_str string,
   retweeted_status struct <
      created_at: string,
      entities: struct <
         hashtags: array ,
               text: string>>,
         media: array ,
               media_url: string,
               media_url_https: string,
               sizes: array >,
               url: string>>,
         urls: array ,
               url: string>>,
         user_mentions: array ,
               name: string,
               screen_name: string>>>,
      geo: struct <
         coordinates: array ,
         type: string>,
      id: bigint,
      id_str: string,
      in_reply_to_screen_name: string,
      in_reply_to_status_id: bigint,
      in_reply_to_status_id_str: string,
      in_reply_to_user_id: int,
      in_reply_to_user_id_str: string,
      source: string,
      text: string,
      user: struct <
         id: int,
         id_str: string,
         name: string,
         profile_image_url_https: string,
         protected: boolean,
         screen_name: string,
         verified: boolean>>,
   source string,
   text string,
   user struct <
      id: int,
      id_str: binary,
      name: string,
      profile_image_url_https: string,
      protected: boolean,
      screen_name: string,
      verified: boolean>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;

Load it full of data from the tweet JSON file we created last tutorial:

LOAD DATA LOCAL INPATH '/path/to/all_tweets.json' OVERWRITE INTO TABLE tweets;

Verify our data loaded with a count:

SELECT COUNT(*) from tweets;
OK
24655

Our tweets are loaded!…

Read More

Go to page:« First...23456...10...Last »