The Hortonworks Blog

Today we are excited to see another example of the power of community at work as we highlight the newly approved Apache Software Foundation incubator project named Falcon. This incubation project was initiated by the team at InMobi together with engineers from Hortonworks. Falcon is useful to anyone building apps on Hadoop as it simplifies data management through the introduction of a data lifecycle management framework.

All About Falcon and Data Lifecycle Management

Falcon is a data lifecycle management framework for Apache Hadoop that enables users to configure, manage and orchestrate data motion, disaster recovery, and data retention workflows in support of business continuity and data governance use cases.…

The slides and videos from Hadoop Summit in Amsterdam have begun to flow so you can enjoy the sessions.

Whilst you’re thinking about which sessions to watch and read, then we suggest taking a look at the keynotes for the event:
  • What is the point of Hadoop? (VideoSlides)
  • Matt Aslett, Research Director, Data Management and Analytics, 451 Research
  • Real-World insight into Hadoop in the Enterprise (Video)
  • Panel featuring HSBC, eBay, Neustar and More
We hope you enjoy these sessions, and the content from the tracks.…

On 27th March, the Wall Street Journal published an article ‘VCs Bet Big Bucks on Hadoop’ and it seems clear that the market is going to be huge. But what does that mean to you and your personal skills investment? Here’s our view:

Hadoop is HOT

Hadoop is incredibly hot right now as the number of available jobs continues to grow enormously (hey – we even have a bunch of our own right here).…

And the voting is over and the results are in for the Community Choice program of the Hadoop Summit San Jose 2013.

With over 300 sessions, and around 6000 users casting more than 15000 votes there was a lot of excitement to participate and influence the results - thanks to everyone for your contribution. At the end of the process, the selectees are:

  • Application and Data Science Track: Watching Pigs Fly with the Netflix Hadoop Toolkit (Netflix)
  • Deployment and Operations Track: Continuous Integration for the Applications on top of Hadoop (Yahoo!)
  • Enterprise Data Architecture Track: Next Generation Analytics: A Reference Architecture (Mu Sigma)
  • Future of Apache Hadoop Track: Jubatus: Real-time and Highly-scalable Machine Learning Platform (Preferred Infrastructure, Inc.)
  • Hadoop (Disruptive) Economics Track: Move to Hadoop, Go Fast and Save Millions: Mainframe Legacy Modernization (Sears Holding Corp.)
  • Hadoop-driven Business / BI Track: Big Data, Easy BI (Yahoo!)
  • Reference Architecture Track: Genie – Hadoop Platformed as a Service at Netflix (Netflix)

Congratulations to the selectees for each track, and a further honorable mention to Sears for winning the ‘Longest Session Title So Far’ which was a surprisingly hard fought contest!…

In this post, we’ll explain the difference between Hadoop 1.0 and 2.0. After all, what is Hadoop 2.0? What is YARN?

For starters – what is Hadoop and what is 1.0? The Apache Hadoop project is the core of an entire ecosystem of projects. It consists of four modules (see here):

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

We want to take a moment to thank everyone who attended the Hadoop Summit in Amsterdam - THANK YOU! With nearly 500 people registered for the event we think we can safely say is was a big success. We’ve had overwhelming support to do it again next year – so watch this space.

The awesome Beurs Van Berlage venue set us up for a series of fantastic conversations and really well attended sessions and talks as Hadoop continues to explode onto the enterprise scene .…

Over the last several weeks, Hortonworks has made a number of announcements regarding the Hortonworks Data Platform (HDP), including the upcoming release of HDP on Windows, the only Apache Hadoop distribution available on Microsoft Windows. We’ve been busy expanding out Hadoop training offerings: we now offer classes for HDP on Windows, you can find training in Europe through our global training partners and you can join us for Apache Hadoop courses in our new corporate headquarters, where you can have lunch with one of the committers.…

Guest blog post from Eric Hanson, Principal Program Manager, Microsoft

Hadoop had a crazy and collaborative beginning as an OSS project, and that legacy continues. There have been over 1,200 contributors across 80 companies since its beginning. Microsoft has been contributing to Hadoop since October 2011, and we’re committed to giving back and keeping it open.

Our first wave of contributions, in collaboration with Hortonworks, has been to port Hadoop to Windows, to enable it both for our HDInsight service on Windows Azure and for on-premises Big Data installations on Windows.…

In part one of this series, we covered how to download your tweet archive from Twitter, ETL it into json/newline format, and to extract a Hive schema. In this post, we will load our tweets into Hive and query them to learn about our little world.

To load our tweet-JSON into Hive, we’ll use the rcongiu Hive-JSON-Serde. Download and build it via:

wget mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api \ -Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar mvn package

Find the jar it generated via:

find .|grep jar ./target/json-serde-1.1.4-jar-with-dependencies.jar ./target/json-serde-1.1.4.jar

Run hive, and create our table with the following commands:

add jar /path/to/my/Hive-Json-Serde/target/json-serde-1.1.4-jar-with-dependencies.jar; create table tweets ( created_at string, entities struct < hashtags: array , text: string>>, media: array , media_url: string, media_url_https: string, sizes: array >, url: string>>, urls: array , url: string>>, user_mentions: array , name: string, screen_name: string>>>, geo struct < coordinates: array , type: string>, id bigint, id_str string, in_reply_to_screen_name string, in_reply_to_status_id bigint, in_reply_to_status_id_str string, in_reply_to_user_id int, in_reply_to_user_id_str string, retweeted_status struct < created_at: string, entities: struct < hashtags: array , text: string>>, media: array , media_url: string, media_url_https: string, sizes: array >, url: string>>, urls: array , url: string>>, user_mentions: array , name: string, screen_name: string>>>, geo: struct < coordinates: array , type: string>, id: bigint, id_str: string, in_reply_to_screen_name: string, in_reply_to_status_id: bigint, in_reply_to_status_id_str: string, in_reply_to_user_id: int, in_reply_to_user_id_str: string, source: string, text: string, user: struct < id: int, id_str: string, name: string, profile_image_url_https: string, protected: boolean, screen_name: string, verified: boolean>>, source string, text string, user struct < id: int, id_str: binary, name: string, profile_image_url_https: string, protected: boolean, screen_name: string, verified: boolean> ) ROW FORMAT SERDE '' STORED AS TEXTFILE;

Load it full of data from the tweet JSON file we created last tutorial:

LOAD DATA LOCAL INPATH '/path/to/all_tweets.json' OVERWRITE INTO TABLE tweets;

Verify our data loaded with a count:

SELECT COUNT(*) from tweets; OK 24655

Our tweets are loaded!…

We are very pleased to announce the Alpha 2 release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha2) is now available for download!

A key focus in HDP 2.0 Alpha 2 is on performance as announced in the Stinger initiative, and includes a series of enhancements to the performance of Apache Hive for interactive SQL queries.  In fact HDP 2.0 Alpha 2 was used to perform the tests announced yesterday, showing a 45X performance increase using Hive. …

Note: Continued in part two

Your Twitter Archive

Twitter has a new feature, Your Twitter Archive, that enables any user to download their tweets as an archive. To view this feature, look at the bottom of the page at your account settings page. There should be an option for ‘Your Twitter archive,’ which will generate your tweets as a json/javascript web application and send them to you in email as a zip file.…

Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan

A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.…

Hot on the heels of the release of the new version of Sandbox, I thought it would be worth a look at Ambari as it is now integrated into the Sandbox VM. You can download the Hortonworks Sandbox and try it out for yourself!

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It greatly simplifies and reduces the complexity of running Apache Hadoop. Ambari is a fully open-source, Apache project and graphical interface to Hadoop.…

We are excited to tell you about the newest release of the Hortonworks Sandbox.

The Hortonworks Sandbox provides the fastest onramp to Apache Hadoop with an easy-to-use, integrated learning environment and a functional personal Hadoop environment. The Sandbox takes the complexity out of Hadoop installation and set up by providing a fully functional virtual image. If you are evaluating Apache Hadoop or need an easy way to prove out use cases then the Sandbox is for you.…

This post co-authored by Arun Murthy.

It’s been an exciting time for the Apache Hadoop community with new and innovative projects happening around performance (Apache Tez) — part of the Stinger initiative — and security (Apache Knox). In addition Hortonworks recently announced the availability of the beta version of Hortonworks Data Platform for Windows.

One of the things we believe strongly in here at Hortonworks is community driven open source and, obviously, the bigger the community, the better.…

Go to page:« First...1020...2627282930...40...Last »