From the Dev Team

Follow the latest developments from our technical team

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive.

We are excited that another critical Enterprise Hadoop integration requirement – NFS Gateway access to HDFS – is making progress through the main Apache Hadoop trunk.  This effort is architected and designed by Brandon Li and Suresh Srinivas, and is being delivered by the community. You can track progress in Apache JIRA HDFS-4750.

With NFS access to HDFS, you can mount the HDFS cluster as a volume on client machines and have native command line, scripts or file explorer UI to view HDFS files and load data into HDFS.  …

The following post is from Nicolas Liochon and Devaraj Das with thanks to all members of the HBase team.

HBase is an always-available service and remains available in the face of machine failures and rack failures. Machines in the cluster runs RegionServer daemons. When a RegionServer crashes or the machine goes offline, the regions it was hosting goes offline as well. The focus of the MTTR work in HBase is to be able to detect abnormalities and to be able to restore access to (failed) offlined regions as early as possible.…

Almost time to spend a relaxing weekend in the garden, or crushing some data in your garage-based homebrew Hadoop cluster – whichever you prefer. But before we choose our path, let’s take a look at the last two weeks of happenings (I was lost in Oregon last week).

Hadoop is the perfect app for OpenStack. While I was struggling with driving directions, Red Hat, Marantis and Hortonworks were announcing plans for Project Savanna which aims to automate the deployment of Hadoop on enterprise-class OpenStack-powered clouds. …

On April 2nd, Hortonworks was excited to host the very first Apache Ambari Meetup. Thanks to all those who came along in person and virtually for a lot of vibrant discussion. If you would like to get involved in future Ambari Meetups, please visit this link. We are well on the way to making Hadoop management ‘dead simple’.

We have embedded the sessions below with some notes:

Overview and Demo of Ambari, Yusaku Sako, Hortonworks

  • This session covered Apache Ambari’s mission to “Make Hadoop management dead simple”, Ambari’s 4 major roles: 1) Provision, 2) Manage, 3) Monitor, and 4) Integrate, emphasized that everything that Ambari’s Web Client does is done thru Ambari’s REST API (100% REST), presented high-level architecture, and a live demo on how to provision, manage, and monitor a Hadoop cluster using the latest Ambari 1.2.2 release.

Check out our new knowledgebase article on Ambari on EC2. With these instructions, you can boot an EC2 Apache Hadoop cluster in minutes using Ambari.

Unstructured data, semi-structured data, structured data… it is all very interesting and we are in conversations about big and small versions of each of these data types every day. We love it…  we are data geeks at Hortonworks. We passionately understand that if you want to use any piece of data for some computation, there needs to be some layer of metadata and structure to interact with it.  Within Hadoop, this critical metadata service is provided by HCatalog.…

Guest blog post from Eric Hanson, Principal Program Manager, Microsoft

Hadoop had a crazy and collaborative beginning as an OSS project, and that legacy continues. There have been over 1,200 contributors across 80 companies since its beginning. Microsoft has been contributing to Hadoop since October 2011, and we’re committed to giving back and keeping it open.

Our first wave of contributions, in collaboration with Hortonworks, has been to port Hadoop to Windows, to enable it both for our HDInsight service on Windows Azure and for on-premises Big Data installations on Windows.…

In part one of this series, we covered how to download your tweet archive from Twitter, ETL it into json/newline format, and to extract a Hive schema. In this post, we will load our tweets into Hive and query them to learn about our little world.

To load our tweet-JSON into Hive, we’ll use the rcongiu Hive-JSON-Serde. Download and build it via:

wget http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/jdo2-api-2.3-ec.jar
mvn install:install-file -DgroupId=javax.jdo -DartifactId=jdo2-api \
-Dversion=2.3-ec -Dpackaging=jar -Dfile=jdo2-api-2.3-ec.jar
mvn package

Find the jar it generated via:

find .|grep jar
./target/json-serde-1.1.4-jar-with-dependencies.jar
./target/json-serde-1.1.4.jar

Run hive, and create our table with the following commands:

add jar /path/to/my/Hive-Json-Serde/target/json-serde-1.1.4-jar-with-dependencies.jar;

create table tweets (
created_at string,
entities struct <
hashtags: array ,
text: string>>,
media: array ,
media_url: string,
media_url_https: string,
sizes: array >,
url: string>>,
urls: array ,
url: string>>,
user_mentions: array ,
name: string,
screen_name: string>>>,
geo struct <
coordinates: array ,
type: string>,
id bigint,
id_str string,
in_reply_to_screen_name string,
in_reply_to_status_id bigint,
in_reply_to_status_id_str string,
in_reply_to_user_id int,
in_reply_to_user_id_str string,
retweeted_status struct <
created_at: string,
entities: struct <
hashtags: array ,
text: string>>,
media: array ,
media_url: string,
media_url_https: string,
sizes: array >,
url: string>>,
urls: array ,
url: string>>,
user_mentions: array ,
name: string,
screen_name: string>>>,
geo: struct <
coordinates: array ,
type: string>,
id: bigint,
id_str: string,
in_reply_to_screen_name: string,
in_reply_to_status_id: bigint,
in_reply_to_status_id_str: string,
in_reply_to_user_id: int,
in_reply_to_user_id_str: string,
source: string,
text: string,
user: struct <
id: int,
id_str: string,
name: string,
profile_image_url_https: string,
protected: boolean,
screen_name: string,
verified: boolean>>,
source string,
text string,
user struct <
id: int,
id_str: binary,
name: string,
profile_image_url_https: string,
protected: boolean,
screen_name: string,
verified: boolean>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’
STORED AS TEXTFILE;

Load it full of data from the tweet JSON file we created last tutorial:

LOAD DATA LOCAL INPATH ‘/path/to/all_tweets.json’ OVERWRITE INTO TABLE tweets;

Verify our data loaded with a count:

SELECT COUNT(*) from tweets;
OK
24655

Our tweets are loaded!…

Note: Continued in part two

Your Twitter Archive

Twitter has a new feature, Your Twitter Archive, that enables any user to download their tweets as an archive. To view this feature, look at the bottom of the page at your account settings page. There should be an option for ‘Your Twitter archive,’ which will generate your tweets as a json/javascript web application and send them to you in email as a zip file.…

Written with Vinod Kumar Vavilapalli and Gopal Vijayaraghavan

A few weeks back we blogged about the Stinger Initiative and set a promise to work within the open community to make Apache Hive 100 times faster for SQL interaction with Hadoop. We have a broad set of scenarios queued up for testing but are so excited about the early results of this work that we thought we’d take the time to share some of this with you.…

Hot on the heels of the release of the new version of Sandbox, I thought it would be worth a look at Ambari as it is now integrated into the Sandbox VM. You can download the Hortonworks Sandbox and try it out for yourself!

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It greatly simplifies and reduces the complexity of running Apache Hadoop. Ambari is a fully open-source, Apache project and graphical interface to Hadoop.…

Installing the Hortonworks Data Platform for Windows couldn’t be easier. Lets take a look at how to install a one node cluster on your Windows Server 2012 machine. // to let us know if you’d like more content like this.

/centerTo start, download the HDP for Windows MSI at http://hortonworks.com/products/hdp-windows/#install/. It is about 460MB, and will take a moment to download. Documentation for the download is available here.…

Cat Miller is an engineer at Mortar Data, a Hadoop-as-a-service provider, and creator of mortar, an open source framework for data processing.

Introduction

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.…

Apache Pig version 0.11 was released last week. An Apache Pig blog post summarized the release. New features include:

  • A DateTime datatype, documentation here.
  • A RANK function, documentation here.
  • A CUBE operator, documentation here.
  • Groovy UDFs, documentation here.

And many improvements. Oink it up for Pig 0.11! Hortonworks’ Daniel Dai gave a talk on Pig 0.11 at Strata NY, check it out:…

Go to page:« First...910111213...Last »