Category Archives: Ambari


Hadoop SDK and Tutorials for Microsoft .NET Developers

Microsoft has begun to treat its developer community to a number of Hadoop-y releases related to its HDInsight (Hadoop in the cloud) service, and it’s worth rounding up the material. It’s all Alpha and Preview so YMMV but looks like fun:

  • Microsoft .NET SDK for Hadoop. This kit provides .NET API access to aspects of HDInsight including HDFS, HCatalag, Oozie and Ambari, and also some Powershell scripts for cluster management. There are also libraries for MapReduce and LINQ to Hive. The latter is really interesting as it builds on the established technology for .NET developers to access most data sources to deliver the capabilities of the de facto standard for Hadoop data query.
  • HDInsight Labs Preview. Up on Github, there is a series of 5 labs covering C#, JavaScript and F# coding for MapReduce jobs, using Hive, and then bringing that data into Excel. It also covers some Mahout use to build a recommendation engine.
  • Microsoft Hive ODBC Driver. The examples above use this preview driver to enable the connection from Hive to Excel.

If all of the above excites you our Hadoop on Windows for Developers training course also similar content in a lot of depth.

You can read more about the partnership between Hortonworks and Microsoft here, and you can download a preview of HDP for Windows here, or sign up for HDInsight over here. And if you’re hungry for more Hadoop tutorials, grab our own Hortonworks Sandbox here.

Field Notes: Apache Ambari Meetup at Hortonworks

On April 2nd, Hortonworks was excited to host the very first Apache Ambari Meetup. Thanks to all those who came along in person and virtually for a lot of vibrant discussion. If you would like to get involved in future Ambari Meetups, please visit this link. We are well on the way to making Hadoop management ‘dead simple’.

We have embedded the sessions below with some notes:

Overview and Demo of Ambari, Yusaku Sako, Hortonworks

    • This session covered Apache Ambari’s mission to “Make Hadoop management dead simple”, Ambari’s 4 major roles: 1) Provision, 2) Manage, 3) Monitor, and 4) Integrate, emphasized that everything that Ambari’s Web Client does is done thru Ambari’s REST API (100% REST), presented high-level architecture, and a live demo on how to provision, manage, and monitor a Hadoop cluster using the latest Ambari 1.2.2 release.
    • The project website can be found at http://incubator.apache.org/ambari, with all the info about Ambari within.  There was encouragement for everyone to contribute to Ambari’s success through filing bugs, participating in mailing list discussions, providing feedback and direction, writing documentation, submitting patches, etc.
  • APIs and SPIs of Ambari (How to Integrate with Ambari). Tom Beerbower (Hortonworks)
    • Tom presented on the details of Ambari’s REST API and a live demo of the API working in action.
    • The SPI (Service Provider Interface) and how its plug ability allows various integration scenarios were explained.
    • This was a great lead up to the next presentation by Teradata, who integrated Hadoop monitoring to their management software Teradata ViewPoint using Ambari’s REST API and SPI.
  • Teradata ViewPoint Hadoop Integration with Ambari. Steve Ratay (Teradata)
    • Steve presented on how Ambari is a key enabler for integrating Hadoop monitoring to ViewPoint.  Without Ambari, integration of Hadoop monitoring to ViewPoint would have been difficult (need to collect metrics from a number of sources spread across different technologies and formats, such as Ganglia, Nagios, JMX, and screen scraping Hadoop’s native web UI).  Ambari REST API provides a central place with a single, consistent data format (JSON) with powerful querying capabilities.  As Steve put it, “Ambari to the Rescue!”.
    • The highlight of the presentation was a live demo of Teradata ViewPoint with a plethora of Hadoop metrics exposed through Ambari REST API behind the scenes.  Steve said that the integration only took a couple of months of effort by a couple of people.
  • Ambari Futures. Jeff Sposetti (Hortonworks)
    • Jeff presented what’s in store for Ambari in the future starting with what’s currently being worked on for the 1.3.0 release, as well as beyond…
    • There was a lot of interest in the room around extensible stack definitions to integrate any Hadoop ecosystem component.
    • Also a concept of Cluster Blueprints was shared, where it would allow “zero-touch” and headless installs of Hadoop clusters.

You can watch all 2 hours of proceedings at this link. Once again, thanks to everyone who attended and took part in a great conversation – see you next time!

Week in Review: Sandboxes, HDP 2.0 Alpha 2, Hive Performance and Summits

Hadoop Summit It’s almost time for that final drive home of the week, and what a week it has been with a few new releases, a summit, and a little bit of technical fun. Here’s what happened:

New Sandbox Release. Yes, your favorite Hadoop VM image just got even better. Cheryle took us through the new features which included Ambari integration and Russell followed up with a quick tour of Ambari. There’s still plenty of time to download Sandbox for a weekend of data crunching fun.

HDP 2.0 Alpha 2 was released. This preview release demonstrates some of the performance improvements in store for the final HDP 2.0 release via YARN, enhancements to Hive per the Stinger Initiative, and Apache Tez. Just before the release, we posted some early test results which showed a 45X (yes, that’s forty five) performance improvement for Hive interactive queries. But that’s just the beginning as we push to 100X, and Microsoft also talked about their contributions to the Stinger Initiative with the same aim in mind.

If you’ve downloaded Sandbox and are looking for some inspiration for a little fun, then Russell also posted a two part series on extracting, loading, querying and analyzing your own Twitter archive with Hive. Part 1 is here, and Part 2 is here.

And finally, there was just the small matter of the Hadoop Summit in AmsterdamWe had a great time and hope you did too. Thank you for attending, contributing to the conversation and supporting Hadoop. If you’re now really excited to learn Hadoop, we posted about available training we have in Europe and Palo Alto.

And that was the week that was. Has your Sandbox downloaded yet?

Touring Ambari

Hot on the heels of the release of the new version of Sandbox, I thought it would be worth a look at Ambari as it is now integrated into the Sandbox VM. You can download the Hortonworks Sandbox and try it out for yourself!

Apache Ambari is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It greatly simplifies and reduces the complexity of running Apache Hadoop. Ambari is a fully open-source, Apache project and graphical interface to Hadoop.

ambari_dashboard

The Ambari Dashboard serves as a home page for your cluster, defining key metrics and linking you through to particular services on the cluster.

ambari_heatmap

Heatmaps show which parts of your cluster are the least or most active, which can help with capacity and load management.

ambari_services

The Ambari Services interface lets you monitor cluster-wide services on your Hadoop cluster.

ambari_hosts

The Ambari Hosts interface lets you drill down to individual hosts that make up your cluster.

ambari_jobs

The Ambari Jobs interface lets you examine the individual applications and jobs that makeup your Hadoop workload.

ambari_users

The Ambari Users interface helps you administer new users on your Hadoop cluster. You can try it out by downloading the new Hortonworks Sandbox. We hope you enjoyed this post, please let us know by commenting!

Big Graph Data on Hortonworks Data Platform

hortonworks-aurelius-header

HDP Monitor The Hortonworks Data Platform (HDP) conveniently integrates numerous Big Data tools in the Hadoop ecosystem. As such, it provides cluster-oriented storage, processing, monitoring, and data integration services. HDP simplifies the deployment and management of a production Hadoop-based system.

In Hadoop, data is represented as key/value pairs. In HBase, data is represented as a collection of wide rows. These atomic structures makes global data processing (via MapReduce) and row-specific reading/writing (via HBase) simple. However, writing queries is nontrivial if the data has a complex, interconnected structure that needs to be analyzed (see Hadoop joins and HBase joins). Without an appropriate abstraction layer, processing highly structured data is cumbersome. Indeed, choosing the right data representation and associated tools opens up otherwise unimaginable possibilities. One such data representation that naturally captures complex relationships is a graph (or network). This post presents Aurelius‘ Big Graph Data technology suite in concert with Hortonworks Data Platform. Moreover, for a real-world grounding, a GitHub clone is described in this context to help the reader understand how to use these technologies for building scalable, distributed, graph-based systems.

Aurelius Graph Cluster and Hortonworks Data Platform Integration

Aurelius Graph Cluster The Aurelius Graph Cluster can be used in concert with Hortonworks Data Platform to provide users a distributed graph storage and processing system with the management and integration benefits provided by HDP. Aurelius’ graph technologies include Titan, a highly-scalable graph database optimized for serving real-time results to thousands of concurrent users and Faunus, a distributed graph analytics engine that is optimized for batch processing graphs represented across a multi-machine cluster.

In an online social system, for example, there typically exists a user base that is creating things and various relationships amongst these things (e.g. likes, authored, references, stream). Moreover, they are creating relationships amongst themselves (e.g. friend, group member). To capture and process this structure, a graph database is useful. When the graph is large and it is under heavy transactional load, then a distributed graph database such as Titan/HBase can be used to provide real-time services such as searches, recommendations, rankings, scorings, etc. Next, periodic offline global graph statistics can be leveraged. Examples include identifying the most connected users, or tracking the relative importance of particular trends. Faunus/Hadoop serves this requirement. Graph queries/traversals in Titan and Faunus are simple, one-line commands that are optimized both semantically and computationally for graph processing. They are expressed using the Gremlin graph traversal language. The roles that Titan, Faunus, and Gremlin play in HDP are diagrammed below. Aurelius and HDP Integration

A Graph Representation of GitHub

Octocat socialite GitHub is an online source code service where over 2 million people collaborate on over 4 million projects. However, GitHub provides more than just revision control. In the last 4 years, GitHub has become a massive online community for software collaboration. Some of the biggest software projects in the world use GitHub (e.g. the Linux kernel).

GitHub is growing rapidly — 10,000 to 30,000 events occur each hour (e.g. a user contributing code to a repository). Hortonworks Data Platform is suited to storing, analyzing, and monitoring the state of GitHub. However, it lacks specific tools for processing this data from a relationship-centric perspective. Representing GitHub as a graph is natural because GitHub connects people, source code, contributions, projects, and organizations in diverse ways. Thinking purely in terms of key/value pairs and wide rows obfuscates the underlying relational structure which can be leveraged for more complex real-time and batch analytic algorithms.

GitHub Octocat

GitHub provides 18 event types, which range from new commits and fork events, to opening new tickets, commenting, and adding members to a project. The activity is aggregated in hourly archives, [each of which] contains a stream of JSON encoded GitHub events. (via githubarchive.org)

The aforementioned events can be represented according to the popular property graph data model. A graph schema describing the types of “things” and relationships between them is diagrammed below. A parse of the raw data according to this schema yields a graph instance. GitHub Schema

Deploying a Graph-Based GitHub

Amazon EC2 To integrate the Aurelius Graph Cluster with HDP, Whirr is used to launch a 4 m1.xlarge machine cluster on Amazon EC2. Detailed instructions for this process are provided on the Aurelius Blog, with the exception that a modified Whirr properties file must be used for HDP. A complete HDP Whirr solution is currently in development. To add Aurelius technologies to an existing HDP cluster, simply download Titan and Faunus, which interface with installed components such as Hadoop and HBase without further configuration.

5830 hourly GitHub Archive files between mid-March 2012 and mid-November 2012 contain 31 million GitHub events. The archive files are parsed to generate a graph. For example, when a GitHub push event is parsed, vertices with the types user, commit, and repository are generated. An edge with label pushed links the user to the commit and an edge with label to links the commit to the repository. The user vertex has properties such as user name and email address, the commit vertex has properties such as the unique sha sum identifier for the commit and its timestamp, and the repository vertex has properties like its URL and the programming language used. In this way, the 31 million events give rise to 27 million vertices and 79 million edges (a relatively small graph). Complete instructions for parsing the data are in the githubarchive-parser documentation. Once the configuration options are reviewed, launching the automated parallel parser is simple.

$ export LC_ALL="C"
$ export JAVA_OPTIONS="-Xmx1G"
$ python AutomatedParallelParser.py batch

The generated vertex and edge data is imported into the Titan/HBase cluster using the BatchGraph wrapper of the Blueprints graph API (a simple, single threaded insertion tool).

$ export JAVA_OPTIONS="-Xmx12G"
$ gremlin -e ImportGitHubArchive.groovy vertices.txt edges.txt

Titan: Distributed Graph Database

Titan: A Distributed Graph Database Titan is a distributed graph database that leverages existing storage systems for its persistence. Currently, Titan provides out-of-the-box support for Apache HBase and Cassandra (see documentation). Graph storage and processing in a clustered environment is made possible because of numerous techniques to both efficiently represent a graph within a BigTable-style data system and to efficiently process that graph using linked-list walking and vertex-centric indices. Moreover, for the developer, Titan provides native support for the Gremin graph traversal language. This section will demonstrate various Gremlin traversals over the parsed GitHub data.

The following Gremlin snippet determines which repositories Marko Rodriguez (okram) has committed to the most. The query first locates the vertex with name okram and then takes outgoing pushed-edges to his commits. For each of those commits, the outgoing to-edges are traversed to the repository that commit was pushed to. Next, the name of the repository is retrieved and those names are grouped and counted. The side-effect count map is outputted, sorted in decreasing order, and displayed. A graphical example demonstrating gremlins walking is diagrammed below.

gremlin> g = TitanFactory.open('bin/hbase.local')                
==>titangraph[hbase:127.0.0.1]
gremlin> g.V('name','okram').out('pushed').out('to').github_name.groupCount.cap.next().sort{-it.value}
==>blueprints=413
==>gremlin=69
==>titan=49
==>pipes=49
==>rexster=40
==>frames=26
==>faunus=23
==>furnace=9
==>tinkubator=5
==>homepage=1

Github Gremlin Traversal

The above query can be taken 2-steps further to determine Marko’s collaborators. If two people have pushed commits to the same repository, then they are collaborators. Given that the number of people committing to a repository could be many and typically, a collaborator has pushed numerous commits, a max of 2500 such collaborator paths are searched. One of the most important aspects of graph traversing is understanding the combinatorial path explosions that can occur when traversing multiple hops through a graph (see Loopy Lattices).

gremlin> g.V('name','okram').out('pushed').out('to').in('to').in('pushed').hasNot('name','okram')[0..2500]
   .name.groupCount.cap.next().sort{-it.value}[0..4]
==>lvca=877
==>spmallette=504
==>sgomezvillamor=424
==>mbroecheler=356
==>joshsh=137

Complex traversals are easy to formulate with the data in this representation. For example, Titan can be used to generate followship recommendations. There are numerous ways to express a recommendation (with varying semantics). A simple one is: “Recommend me people to follow based on people who watch the same repositories as me. The more repositories I watch in common with someone, the higher they should be ranked.” The traversal below starts at Marko, then traverses to all the repositories that Marko watches. Then to who else (not Marko) looks at those repositories and finally counts those people and returns the top 5 names of the sorted result set. In fact, Marko and Stephen (spmallette) are long time collaborators and thus, have similar tastes in software.

gremlin> g.V('name','okram').out('watched').in('watched').hasNot('name','okram').name.groupCount
   .cap.next().sort{-it.value}[0..4]
==>spmallette=3
==>alex-wajam=3
==>crimeminister=2
==>redgetan=2
==>snicaise=2
gremlin> g.V('name','okram').out('created').has('type','Comment').count()
==>159
gremlin> g.V('name','okram').out('created').has('type','Issue').count()  
==>176
gremlin> g.V('name','okram').out('edited').count()                     
==>85

A few self-describing traversals are presented above that are rooted at okram. Finally, note that Titan is optimized for local/ego-centric traversals. That is, from a particular source vertex (or small set of vertices), use some path description to yield a computation based on the explicit paths walked. For doing global graph analyses (where the source vertex set is the entire graph), a batch processing framework such as Faunus is used.

Faunus: Graph Analytics Engine

Faunus: Graph Computing with HadoopEvery Titan traversal begins at a small set of vertices (or edges). Titan is not designed for global analyses which involve processing the entire graph structure. The Hadoop component of Hortonworks Data Platform provides a reliable backend for global queries via Faunus. Gremlin traversals in Faunus are compiled down to MapReduce jobs, where the first job’s InputFormat is Titan/HBase. In order to not interfere with the production Titan/HBase instance, a snapshot of the live graph is typically generated and stored in Hadoop’s distributed file system HDFS as a SequenceFile available for repeated analysis. The most general SequenceFile (with all vertices, edges, and properties) is created below (i.e. a full graph dump).

faunus$ cat bin/titan-seq.properties 
faunus.graph.input.format=com.thinkaurelius.faunus.formats.titan.hbase.TitanHBaseInputFormat
hbase.zookeeper.quorum=10.68.65.161
hbase.mapreduce.inputtable=titan
hbase.mapreduce.scan.cachedrows=75
faunus.graph.output.format=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
faunus.sideeffect.output.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
faunus.output.location=full-seq
faunus.output.location.overwrite=true

faunus$ bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(_)-oOOo-----
gremlin> g = FaunusFactory.open('bin/titan-seq.properties')
==>faunusgraph[titanhbaseinputformat]
gremlin> g._().toString()
==>[IdentityMap]
gremlin> g._()
12/12/13 09:19:53 INFO mapreduce.FaunusCompiler: Compiled to 1 MapReduce job(s)
12/12/13 09:19:55 INFO mapred.JobClient:  map 0% reduce 0%
12/12/13 09:21:26 INFO mapred.JobClient:  map 1% reduce 0%
12/12/13 09:21:36 INFO mapred.JobClient:  map 2% reduce 0%
12/12/13 09:21:43 INFO mapred.JobClient:  map 3% reduce 0%
...
gremlin> hdfs.ls()
==>rwx------ ubuntu supergroup 0 (D) .staging
==>rwxr-xr-x ubuntu supergroup 0 (D) full-seq
gremlin> hdfs.ls('full-seq/job-0')
==>rw-r--r-- ubuntu supergroup 0 _SUCCESS
==>rwxr-xr-x ubuntu supergroup 0 (D) _logs
==>rw-r--r-- ubuntu supergroup 243768636 part-m-00000
==>rw-r--r-- ubuntu supergroup 125250887 part-m-00001
==>rw-r--r-- ubuntu supergroup 331912876 part-m-00002
==>rw-r--r-- ubuntu supergroup 431617929 part-m-00003
...

Given the generated SequenceFile, the vertices and edges are counted by type and label, which is by definition a global operation.

gremlin> g.V.type.groupCount
==>Gist         780626
==>Issue        1298935
==>Organization 36281
==>Comment      2823507
==>Commit       20338926
==>Repository   2075934
==>User         983384
==>WikiPage     252915
gremlin> g.E.label.groupCount                                           
==>deleted        170139
==>on             7014052
==>owns           180092
==>pullRequested  930796
==>pushed         27538088
==>to             27719774
==>added          181609
==>created        10063346
==>downloaded     122157
==>edited         276609
==>forked         1015435
==>of             536816
==>appliedForkTo  1791
==>followed       753451
==>madePublic     26602
==>watched        2784640

Since GitHub is collaborative in a way similar to Wikipedia, there are a few users who contribute a lot, and many users who contribute little or none at all. To determine the distribution of contributions, Faunus can be used to compute the out degree distribution of pushed-edges, which correspond to users pushing commits to repositories. This is equivalent to Gremlin visiting each user vertex, counting all of the outgoing pushed-edges, and returning the distribution of counts.

gremlin> g.V.sideEffect('{it.degree = it.outE("pushed").count()}').degree.groupCount
==>1	57423
==>10	8856
==>100	527
==>1000	9
==>1004	5
==>1008	6
==>1011	6
==>1015	6
==>1019	3
==>1022	9
==>1026	2
==>1033	6
==>1037	4
==>104	462
==>1040	3
==>...

When the degree distribution is plotted using log-scaled axes, the results are similar to the Wikipedia contribution distribution, as expected. This is a common theme in most natural graphs — real-world graphs are not random structures and are composed of few “hubs” and numerous “satellites.”
github-pushed-out-degree-distribution

Hortonworks with Gremlin More sophisticated queries can be performed by first extracting a slice of the original graph that only contains relevant information. These slices can be saved to HDFS for subsequent traversals. For example, to calculate the most central co-watched project on GitHub, the primary graph is stripped down to only watched-edges between users and repositories. The final traversal below, walks the “co-watched” graph 2 times and counts the number of paths that have gone through each repository. The repositories are sorted by their path counts in order to express which repositories are most central/important/respected according to the watches subgraph.

gremlin> g.E.has('label','watched').keep.V.has('type','Repository','User').keep
...
12/12/13 11:08:13 INFO mapred.JobClient:   com.thinkaurelius.faunus.mapreduce.sideeffect.CommitVerticesMapReduce$Counters
12/12/13 11:08:13 INFO mapred.JobClient:     VERTICES_DROPPED=19377850
12/12/13 11:08:13 INFO mapred.JobClient:     VERTICES_KEPT=2074099
12/12/13 11:08:13 INFO mapred.JobClient:   com.thinkaurelius.faunus.mapreduce.sideeffect.CommitEdgesMap$Counters
12/12/13 11:08:13 INFO mapred.JobClient:     OUT_EDGES_DROPPED=55971128
12/12/13 11:08:13 INFO mapred.JobClient:     OUT_EDGES_KEPT=1934706
...
gremlin> g = g.getNextGraph()
gremlin> g.V.in('watched').out('watched').in('watched').out('watched').property('_count',Long.class)
   .order(F.decr,'github_name')
==>backbone	4173578345
==>html5-boilerplate	4146508400
==>normalize.css	3255207281
==>django	3168825839
==>three.js	3078851951
==>Modernizr	2971383230
==>rails	2819031209
==>httpie	2697798869
==>phantomjs	2589138977
==>homebrew	2528483507
...

Conclusion

Aurelius This post discussed the use of Hortonworks Data Platform in concert with the Aurelius Graph Cluster to store and process the graph data generated by the online social coding system GitHub. The example data set used throughout was provided by GitHub Archive, an ongoing record of events in GitHub. While the dataset currently afforded by GitHub Archive is relatively small, it continues to grow each day. The Aurelius Graph Cluster has been demonstrated in practice to support graphs with hundreds of billions of edges. As more organizations realize the graph structure within their Big Data, the Aurelius Graph Cluster is there to provide both real-time and batch graph analytics.

Acknowledgments

The authors wish to thank Steve Loughran for his help with Whirr and HDP. Moreover, Russell Jurney requested this post and, in a steadfast manner, ensured it was delivered.

Related Material

Hawkins, P., Aiken, A., Fisher, K., Rinard, M., Sagiv, M., “Data Representation Synthesis,” PLDI’11, June 2011.

Pham, R., Singer, L., Liskin, O., Filho, F. F., Schneider, K., “Creating a Shared Understanding of
Testing Culture on a Social Coding Site
.” Leibniz Universität Hannover, Software Engineering Group: Technical Report, Septeber 2012.

Alder, B. T., de Alfaro, L., Pye, I., Raman V., “Measuring Author Contributions to the Wikipedia,” WikiSym ’08 Proceedings of the 4th International Symposium on Wikis, Article No. 15, September 2008.

Rodriguez, M.A., Mallette, S.P., Gintautas, V., Broecheler, M., “Faunus Provides Big Graph Data Analytics,” Aurelius Blog, November 2012.

Rodriguez, M.A., LaRocque, D., “Deploying the Aurelius Graph Cluster,” Aurelius Blog, October 2012.

Ho, R., “Graph Processing in Map Reduce,” Pragmatic Programming Techniques Blog, July 2010.

Authors


Vadas Gintautas Marko A. Rodriguez

Meet the Committer: Mahadev Konar

We had another amazing turn out on our Ambari webinar with Matt Foley a couple of weeks back. This series was meant to educate Hadoop enthusiasts and help them gain better understanding of the value of Hadoop and I think we’re on the right track. If you missed or would like a refresher from our last two webinars (Pig and Ambari) you can find the recording here: https://hortonworks.com/webinars/

We’re starting the third installment of the “Future of Apache Hadoop” series next Wednesday on “Scaling Apache Zookeeper to the Next Generation Applications” with Mahadev Konar (@mahadevkonar) Hortonworks co-founder and core contributor and PMC member of the Apache Zookeeper.

Get to know Mahadev in this third installment of our “Meet the Committer” series.

Kim: Tell us about your current role and how you interact with Apache Hadoop?

Mahadev: Currently I am leading the effort on Apache Ambari. I have spent last 5 to 6 years of my life working on Apache Hadoop and its eco system.

Kim: How did the Zookeeper project come about?

Mahadev: Apache ZooKeeper was started by a couple of my colleagues in research (Flavio and Ben) both brilliant researchers from Yahoo! (Ben has currently moved on to a different opportunity). I started working with them from the early days of ZooKeeper. We had first open sourced ZooKeeper in Sourceforge but then later moved it as a subproject of Hadoop.

Kim: Can you provide a sneak peek of your presentation and what do you expect will be key take-away for folks attending this webinar?

Mahadev: I’ll be going through a couple of use cases for Apache ZooKeeper and basic tutorial on what ZooKeeper is. The talk will also focus on the upcoming features in Apache ZooKeeper.

If you haven’t already, register now and join us next Wednesday (October 17, 2012) at 10am PDT/ 1pm EDT to discuss Apache Zookeeper: http://info.hortonworks.com/FutureofHadoopSeries.html

Meet the Committer, Part Two: Matt Foley

I hope you had fun pigging out to Hadoop with Alan Gates. We had interesting questions during the webinar and as always, your participation in these discussions will help us understand different use cases of Apache Pig and the growing community around this project. The recording is now available on our webinar site.

For the next installation of “Future of Apache Hadoop” webinar series, I would like to introduce to you Matt Foley and Ambari. Matt is a member of Hortonworks technical staff, Committer and PMC member for Apache Hadoop core project and will be our guest speaker on September 26, 2012 @10am PDT / 1pm EDT webinar: Deployment and Management of Hadoop Clusters with AMBARI.

Get to know Matt in this second installment of our “Meet the Committer” series.

Kim: Tell us your role with Apache Hadoop?

Matt: I’m a Committer and PMC member for Apache Hadoop. I’ve also been the Release Manager for the last several releases of Hadoop-1. I want Hadoop and HBase to be used by more and more companies, and to make that easier I’ve become very interested in deployment and monitoring issues, and have contributed to the Ambari project.

Kim: What’s an Ambari?

Matt: An Ambari is the platform or shelter that sits on top of the elephant, for a royal passenger to ride in comfort.  Also known as a “howdah”.

Kim: How did this project came about?

Matt: While the Hortonworks engineers were still part of Yahoo’s Cloud Computing group, they saw the need for an Apache open source project to make it easier to deploy, monitor, and manage Hadoop clusters.  These clusters can be multiple thousands of nodes, and it’s hard to deploy and manage clusters that large!  So we started Ambari, as an Apache “incubator” project, to meet those needs.

Kim: Can you provide a brief use case on why people should want to use/deploy Ambari?

Matt: Suppose you have a serious Big Data application that needs a cluster of even a hundred servers.  You can’t possibly want to login to all those servers and individually install Hadoop on each of them.  And you don’t just want to install Hadoop, you also need HBase and Hive and Pig and Oozie and HCatalog, etc.  You have to install them all, and you have to get the right versions of each so they’ll work together, and you need to start the various services in the right order, on all 100 servers.  Furthermore, before you can install Hadoop, you have to set up quite a bit of configuration on each server, so that the service user IDs will exist and have the right permissions, and so the “install master” server, from which you’re doing all this work, has privileges to push the software to each of the other servers.  Basically, to install manually would take you about half an hour per server, after you get good at it!  So it’s obvious that you need an automation tool to do the deployment.  Ambari can install a whole 100-node cluster in about 20 minutes, and a 1000 node cluster in less than an hour.

Then, after you’ve installed and started up your cluster, you have to monitor it.  In a cluster of a few thousand servers, you can expect to have a server or disk failure per day (although Hadoop will robustly adapt to such failures and keep running fine).  You need a monitoring system to alert you when something goes wrong and tell you what the problem is, or the cluster will degrade over time.  Also, you need to be aware of the load on the system, and whether your Hadoop and HBase jobs are being run efficiently, and whether you’ve provisioned the cluster appropriately.  For all these things, Ambari will automatically set up a monitoring and alerting system, based on open source monitoring tools called Nagios and Ganglia, but configured specifically to monitor Hadoop clusters.  There’s a lot of distilled expertise in Ambari, about how to monitor big Hadoop clusters.

Kim: Can you provide a sneak peek of your presentation and what do you expect will be key take-away for folks attending this webinar?

Matt: This presentation will be very similar to the talk I gave at the Hadoop Summit in June.  I’ll present:

  1. A brief history of Ambari, and how its architecture has evolved and will continue growing;
  2.  In-depth discussion of the Install, Monitor, and Management features, illustrated with screen shots of Ambari being used with an actual cluster.

After the presentation, participants should feel comfortable applying Ambari to create new Hadoop and HBase clusters, and will understand the value of the monitoring and alerting capabilities.

Get ready to geek out to Ambari with Matt, join us on September 26, 2012 @10am PDT/ 1pm EDT for “Deployment and Management of Hadoop Clusters with AMBARI”.

 

Four New Installments in ‘The Future of Apache Hadoop’ Webinar Series

During the ‘Future of Apache Hadoop’ webinar series, Hortonworks founders and core committers will discuss the future of Hadoop and related projects including Apache Pig, Apache Ambari, Apache Zookeeper and Apache Hadoop YARN.

Apache Hadoop has rapidly evolved to become the leading platform for managing, processing and analyzing big data. Consequently there is a thirst for knowledge on the future direction for Hadoop related projects. The Hortonworks webinar series will feature core committers of the Apache projects discussing the essential components required in a Hadoop Platform, current advances in Apache Hadoop, relevant use-cases and best practices on how to get started with the open source platform. Each webinar will include a live Q&A with the individuals at the center of the Apache Hadoop movement.

This four-part webinar series is now open for registration, and the schedule will include:

  • Wednesday, September 12 at 10:00 a.m. PT / 1:00 p.m. ET
  • Pig Out on Hadoop
    With: Alan Gates, Hortonworks founder and contributor to Apache Pig and HCatalog projects.
    Register here.

  • Wednesday, September 26 at 10:00 a.m. PT / 1:00 p.m. ET
  • Deployment and Management of Hadoop Clusters with Ambari
    With: Matt Foley, committer and PMC member of the Apache Hadoop Project and member of Technical Staff at Hortonworks.
    Register here.

  • Wednesday, October 17 at 10:00 a.m. PT / 1:00 p.m. ET
  • Scaling Apache Zookeeper for the Next Generation of Hadoop Applications
    With: Mahadev Konar, Hortonworks founder and contributor to the Apache Pig and HCatalog projects
    Register here.

  • Wednesday, October 31 at 10:00 a.m. PT / 1:00 p.m. ET
  • YARN: The Future of Data Processing with Apache Hadoop
    With: Arun C. Murthy, Hortonworks founder and VP of Apache Hadoop at Apache Software Foundation, the lead of the MapReduce project and YARN.
    Register here.

For more information, please register.

Previous webinars on “The Future of Apache Hadoop” are available here.

A press release is available here.

Click to Tweet: @Hortonworks unveils four new live webinars, with Q&A, on “The Future of Apache Hadoop” series http://bit.ly/OM0XpE #BigData #Hadoop

Hortonworks Data Platform v1.0 Download Now Available

If you haven’t yet noticed, we have made Hortonworks Data Platform v1.0 available for download from our website. Previously, Hortonworks Data Platform was only available for evaluation for members of the Technology Preview Program or via our Virtual Sandbox (hosted on Amazon Web Services). Moving forward and effective immediately, Hortonworks Data Platform is available to the general public.

Hortonworks Data Platform is a 100% open source data management platform, built on Apache Hadoop. As we have stated on many occasions, we are absolutely committed to the Apache Hadoop community and the Apache development process. As such, all code developed by Hortonworks has been contributed back to the respective Apache projects.

Version 1.0 of Hortonworks Data Platform includes Apache Hadoop-1.0.3, the latest stable line of Hadoop as defined by the Apache Hadoop community. In addition to the core Hadoop components (including MapReduce and HDFS), we have included the latest stable releases of essential projects including HBase 0.92.1, Hive 0.9.0, Pig 0.9.2, Sqoop 1.4.1, Oozie 3.1.3 and Zookeeper 3.3.4. All of the components have been tested and certified to work together. We have also added tools that simplify the installation and configuration steps in order to improve the experience of getting started with Apache Hadoop.

Read More

Introducing Hortonworks Data Platform v1.0

I wanted to take this opportunity to share some important news. Today, Hortonworks announced version 1.0 of the Hortonworks Data Platform, a 100% open source data management platform based on Apache Hadoop. We believe strongly that Apache Hadoop, and therefore, Hortonworks Data Platform, will become the foundation for the next generation enterprise data architecture, helping companies to load, store, process, manage and ultimately benefit from the growing volume and variety of data entering into, and flowing throughout their organizations. The imminent release of Hortonworks Data Platform v1.0 represents a major step forward for achieving this vision.

You can read the full press release here. You can also read what many of our partners have to say about this announcement here. We were extremely pleased that industry leaders such as Attunity, Dataguise, Datameer, Karmasphere, Kognitio, MarkLogic, Microsoft, NetApp, StackIQ, Syncsort, Talend, 10gen, Teradata and VMware all expressed their support and excitement for Hortonworks Data Platform.

Those who have followed Hortonworks since our initial launch already know that we are absolutely committed to open source and the Apache Software Foundation. You will be glad to know that our commitment remains the same today. We don’t hold anything back. No proprietary code is being developed at Hortonworks.

Read More

Executive Video Series: Overview of Hortonworks Data Platform

We just released the second video in the Hortonworks Executive Series. This one features Matt Foley, Test and Release Engineering Manager for Hortonworks.

In this video, Matt provides an overview of Hortonworks Data Platform (HDP), including a summary of the Apache Hadoop components included in the distribution and the testing processes involved in the release process. Matt also provides an overview of Apache Ambari, an open source project that is adding monitoring and management capabilities to Apache Hadoop.

Read More