Category Archives: Hadoop Ecosystem


Hadoop Summit Expands to Europe in 2013!

This will be the first and the largest European conference focused exclusively on accelerating the enterprise adoption of Apache Hadoop. The event will be a gathering for the vibrant Apache Hadoop community of developers, data scientists, data professionals and solution providers and will be held at the historic Beurs van Berlage in Amsterdam on March 20-21, 2013.

Call for papers now open!

Apache Hadoop practitioners, enthusiasts and solution providers with an idea for a talk at the event, can submit your ideas now on the call for papers page. All accepted speakers will receive complimentary admission to the event.

More information on Hadoop Summit Europe, go to: http://hadoopsummit.org/amsterdam.

Remember to follow us on Twitter and Facebook for future updates!

We hope to see you there!

Apache Hadoop YARN Meetup at Hortonworks – ReCap!

Introduction

The Apache Hadoop YARN meetup at Hortonworks on October 12, 2012 we previously announced was a resounding success. We had a very good turnout of around seventy people from the community.

Meetup sessions

Deployments at Yahoo!

The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.

  • described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
  • efforts undergone relation to battle testing YARN including application validation and performance benchmarking.
  • summed it up with suggestions for improvements to issues like UI loading, lack of generic history server etc.

Chris Riccomini’s on “Building Applications on YARN”


Chris Riccomini from LinkedIn then presented about his experience in “Building Applications on YARN”. He briefly covered the anatomy of a YARN application and then jumped into various dimensions a YARN application developer should think about – deployment, metrics, logging, application specific configuration to name a few.

The most interesting bits about his presentation include how, pre-production, small instances of YARN clusters can be utilized to develop applications in an agile manner. For example, one could start with using local file system and avoiding HDFS to minimize the operational effort, and then switch over to a full-blown distributed file system when the desire for scalability crosses a threshold. Also worth attention is how YARN’s web-service APIs can be exploited to build custom dashboards.

Chris posted his notes from the meetup and slides on his blog.

YARN API Discussion

After that, Arun recapped the YARN’s powerful scheduling API available to the application developers for using the cluster resources. He walked us through the scheduling concepts, and rounded it up with how scheduling happens in the context of an example MapReduce job.

Bikas and I then proceeded to give a brief overview of what all APIs are available to application developers. We described some of the pain points with the APIs that various users indicated in the recent past and efforts underway to address some of them. To enumerate a few:

  • How to make the scheduling logic explicit – for e.g, that scheduler looks for free resources on a node, then proceeds to a rack and then off-rack
  • Multiple ways to release and reject containers
  • Use-cases which require resources on specific nodes and/or racks
  • Applications that want to avoid/blacklist some nodes and/or racks
  • Limitations on the number of threads making resource requests

We opened the API discussion for further feedback. This exercise was very fulfilling. We discovered how various users were experimenting with the APIs and what pitfalls and limitations they ran into. Some concrete suggestions include:

  • Libraries for recovering AMs, launching containers
  • A generic framework for applications to expose specific data via http or web-services.
  • A generic application history server
  • Tagging nodes with labels like GPU etc and use these labels for scheduling. This is an extension of data locality

Our slides are available here.

Efforts Underway

After a short break, Alejandro Abdelnur from Cloudera briefly talked about the efforts underway to augment YARN with cpu-isolation using cgroups.

Finally, Siddarth Seth from Hortonworks talked about his work on modifying MR application to reuse containers for jobs both large and small. This exciting development opens new innovations in the MapReduce land like intermediate output aggregation. You can read through Sid’s presentation below. The core points covered are:

  • Decoupling the TaskAttempt and Container concepts inside MR AM
  • Add new first class concepts of Container, Node and Scheduler
  • The current state of the effort
  • New avenues this transition opens up – custom task types, output aggregation, performance optimizations.

His slides are available here.

Conclusion

The success of this meetup reaffirmed the excitement of the community about YARN. This also strengthened our desire to make it a recurring event. We look forward to the next one, with hopefully more turnout, extended brainstorming, and of course, more pizza and beer :)

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution.  Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL.  This is a great approach but it lacks integration of metadata and metadata exchange.  With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product.  SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata.  Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze.  All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration
In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics.  With this package, these valuable functions can now be applied to big data in Hadoop.  This shortens the time it takes for an analyst to explore and discover value in big data.  And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Lighting up operations
Often overlooked when an organization considers Hadoop is the impact on IT operations.  They are tasked with making sure a cluster is functional.  Well, these guys have countless tools to perform their job and for Teradata they use Viewpoint Teradata Vital Infrastructure.  In this release, we have integrated the management and monitoring communications use by Ambari with these monitoring tools. Now, the ops guy has a true single pane of glass to monitor the Teradata environment AND the Hadoop cluster used to provide the big data analytics.

Some details on the appliance
The Teradata Aster Big Analytics Appliance runs on proven Teradata hardware, leverages the most current Intel® processor chip technology, SUSE® Linux operating system, and market-leading enterprise-class storage. It can be configured to store a maximum of 5 petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data for Hadoop.

“The Teradata Aster Big Analytics Appliance offers the faster path from diverse big data acquisition to big insights, and seamlessly delivers these insights to the business owners. Unmatched by any other stack in the industry, it enables organizations to overcome the barriers to big data analytics and provides a high-definition view of the business to optimize operations.”– Scott Gnau, president, Teradata Labs.

This is unique and it ushers in a new approach to big data analytics.

Big Data in London – Thoughts From the Tube

Hortonworks sponsored the O’Reilly Strata conference in earlier this month at the Hilton Metropole in London. It was great meeting big data enthusiasts at the conference. We had fun giving away our little green mascot and came away pleasantly surprised at the state of interest in Big Data in the UK and Europe. There were over 500 attendees, which for a first time conference is a very good result. Conversations ranged from introductory “What is Apache Hadoop?” to deep discussions regarding how Hadoop was being used in production today. After talking to other vendors, attendees and organizers it appears that the market is somewhere between 12 and 18 months less mature than the Big Data market in the US. That said we think adoption could occur more quickly in the US as the state of the technology and ecosystem evolves heading into 2013. Below are some perspectives from our team at this conference.

Inspiration from the Tube

Riding the tube around London we couldn’t help but take some guidance and inspiration from the prominently placed signs for the “Way Out” and frequent announcements warning travelers to “Mind the Gap”. These signs and notices as informal guidance for approaching the Big Data market.

Way Out

As more and more organizations realize that their current systems are at risk of being buried underground by the onslaught of Big Data many are starting to realize that Hadoop offers a Way Out.  How you ask? Because it gives them a low cost scale out infrastructure to capture, process and exchange data. With Hadoop they now can cluster commodity servers and storage together to capture, process and exchange data with existing systems. At the same time a modern enterprise ready Hadoop platform like the Hortonworks Data Platform enables them to efficiently and effectively operate these clusters as well but that is for another post.

Mind the Gap

That said when selecting a Hadoop platform it is important to Mind the Gaps in the technology and look for a platform that is being deeply integrated with existing enterprise architecture systems. The best solutions to rely on are those that are created through engineering level engagements to maximize performance and optimize the interaction between the systems.

Deep technical interest and curiosity

Many of the visitors had technical questions, for which we pulled in our UK R&D person, Steve Loughran, armed with copies of the Hadoop 1.x and trunk source trees. The content of those discussions showed that people are already using Hadoop at scale in parts of Europe and nearby. Indeed, we had conversations with people as far away as Finland and Israel, showing that this conference drew a wide audience – and that those people were building up their skills in the technology and applications of Big Data.

There was also the London-and-South of England Hadoop community, who tend to know each other from the London HUG events and other workshops. Many of these are drawn from various startups -Last.fm being one of the earliest adopters of Hadoop; Datasift, Mendeley and others now becoming well known. Alongside them: the enterprises with datasets that historically were too big to store cost-effectively: the telcos, the media companies with their advert click throughs, and the like. These people have the data -and are ramping up the skills to make use of it. For these organizations, bringing up large Hadoop clusters matters -and they’ve realized that Hadoop internals aren’t something they need to know themselves -any more than they need Linux kernel skills. What they do need is Data Science skills: people who know the right questions to ask of that data, how to ask Hadoop for the data to provide the answers, how to interpret those answers -and how to present them.

Many of the Strata topics looked at these problems: cleaning up data, conducting effective A/B tests, and examples of highly effective visualizations of large and near-real-time data sources. One memorable talk from the Formula 1 race team McLaren covered how they had transformed their organization to be data-driven; to use the answers from their in-race telemetry and information gleaned about competitors from public sources to shape their thinking. This shows a future for organizations -to copy McLaren, Google and others to not only collect and analyze data -but to embrace it.

Exciting future for Big Data in Europe

Overall we had many great conversations with attendees regarding their current and more commonly future plans for use of Hadoop and other Big Data technologies. Many of the sessions were packed including a standing room only Microsoft talk on current Hadoop related integration and future plans.

Awareness of Apache Hadoop as a technology was respectable but certainly below that in the US.

Interest in technical and business benefits of Hadoop

Shaun Connolly’s sessions on Hadoop and data warehousing were well attended, as was Steve Loughran’s session on High Availability Hadoop including a live demo.

Finally, Transport for London are themselves participants in the Big Data revolution -their live data feeds of tube, bus and bike-sharing are all there for analysis and integration with other data sources: http://www.tfl.gov.uk/businessandpartners/syndication/16493.aspx. If anyone wants some interesting datasets to learn Pig on, these could be them.

Overall, this was well run event and featured interesting keynotes. It was vibrant, ripe for growth, and was very honored to be approached by multiple user groups seeking speakers from Hortonworks to talk about big data experiences and expertise from this conference.

Thanks to those that attended our sessions and visited and chatted with us at our booth. For a copy of Shaun Connolly and Steve Loughran’s presentations, you can acces it here and here.

Until next time London, mind the gap.

Apache Hadoop 2.0.2-alpha Released!

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us. At the time of release, YARN has already been deployed on super-sized clusters with 2,000 nodes and 3,600 nodes (totaling to nearly 6,000 nodes) at Yahoo alone*.

Going forward, I have no doubt that we are well of our way to sign-off on hadoop-2.x early next year and we are now heads down wrapping up the last of feature work since we have a reasonably stable base, such as:

  • HDFS HA without need for shared storage (already merged into Hadoop trunk sans a couple of design enhancements).
  • YARN ResourceManager availability.
  • YARN scheduling enhancements such as multi-resource scheduling (nearly complete, should be committed soon) and preemption.

Having said that, it’s critical for the developer community to get feedback on hadoop-2.x from the user community to ensure we continue to deliver great software – so, please, do go ahead, download the bits from the Apache Hadoop releases page, try the release and give us your valuable feedback – it’s a personal request! Of course, if you prefer a fully packaged and integrated stack you can browse to the Hortonworks Downloads page to try Hortonworks Data Platform 2.0 Alpha which integrates Hadoop 2.0.2-alpha with other important components such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie

For more information about the HDP 2.0 alpha, you can check out our blog post from yesterday.

Acknowledgements
I’d like to thank everyone who has or continues to contribute to Apache Hadoop – everyone in the community. A special mention for Todd Lipcon for his contributions to HDFS HA and the Yahoo Hadoop team (Robert Evans, Thomas Graves, Daryn Sharp, Jason Lowe and everyone else) for their help in getting YARN to stability and large-scale deployments on their clusters.

*Yahoo is currently running hadoop-0.23.4 release which essentially is hadoop-2.0.2-alpha without HDFS high availability.

Big Data Security Part One: Introducing PacketPig

Series Introduction

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

Introducing Packetpig

Intrusion detection is the analysis of network traffic to detect intruders on your network. Most intrusion detection systems (IDS) look for signatures of known attacks and identify them in real-time. Packetpig is different. Packetpig analyzes full packet captures – that is, logs of every single packet sent across your network – after the fact. In contrast to existing IDS systems, this means that using Hadoop on full packet captures, Packetpig can detect ‘zero day’ or unknown exploits on historical data as new exploits are discovered. Which is to say that Packetpig can determine whether intruders are already in your network, for how long, and what they’ve stolen or abused.

Packetpig is a Network Security Monitoring (NSM) Toolset where the ‘Big Data’ is full packet captures. Like a Tivo for your network, through its integration with Snort, p0f and custom java loaders, Packetpig does deep packet inspection, file extraction, feature extraction, operating system detection, and other deep network analysis. Packetpig’s analysis of full packet captures focuses on providing as much context as possible to the analyst. Context they have never had before. This is a ‘Big Data’ opportunity.

Full Packet Capture: A Big Data Opportunity

What makes full packet capture possible is cheap storage – the driving factor behind ‘big data.’ A standard 100Mbps internet connection can be cheaply logged for months with a 3TB disk. Apache Hadoop is optimized around cheap storage and data locality: putting spindles next to processor cores. And so what better way to analyze full packet captures than with Apache Pig – a dataflow scripting interface on top of Hadoop.

In the enterprise today, there is no single location or system to provide a comprehensive view of a network in terms of threats, sessions, protocols and files. This information is generally distributed across domain-specific systems such as IDS Correlation Engines and data stores, Netflow repositories, Bandwidth optimisation systems or Data Loss Prevention tools. Security Information and Event Monitoring systems offer to consolidate this information but they operate on logs – a digest or snippet of the original information. They don’t provide full fidelity information that can be queried using the exact copy of the original incident.

Packet captures are a standard binary format for storing network data. They are cheap to perform and the data can be stored in the cloud or on low-cost disk in the Enterprise network. The length of retention can be based on the amount of data flowing through the network each day and the window of time you want to be able to peer into the past.

Pig, Packetpig and Open Source Tools

In developing Packetpig, Packetloop wanted to provide free tools for the analysis network packet captures that spanned weeks, months or even years. The simple questions of capture and storage of network data had been solved but no one had addressed the fundamental problem of analysis. Packetpig utilizes the Hadoop stack for analysis, which solves this problem.

For us, wrapping Snort and p0f was a bit of a homage to how much security professionals value and rely on open source tools. We felt that if we didn’t offer an open source way of analysing full packet captures we had missed a real opportunity to pioneer in this area. We wanted it to be simple, turn key and easy for people to take our work and expand on it. This is why Apache Pig was selected for the project.

Understanding your Network

One of the first data sets we were given to analyse was a 3TB data set from a customer. It was every packet in and out of their 100Mbps internet connection for 6 weeks. It contained approximately 500,000 attacks. Making sense of this volume of information is incredibly difficult with current tooling. Even Network Security Monitoring (NSM) tools have difficult with this size of data. However it’s not just size and scale. No existing toolset allowed you to provide the same level of context. Packetpig allows you to join together information related to threats, sessions, protocols (deep packet inspection) and files as well as Geolocation and Operating system detection information.

We are currently logging all packets for a website for six months. This data set is currently around 0.6TB and because all the packet captures are stored in S3 we can quickly scan through the dataset. More importantly, we can run a job every nightly or every 15 minutes to correlate attack information with other data from Packetpig to provide an ultimate amount of context related to security events.

Items of interest include:

  • Detecting anomalies and intrusion signatures
  • Learn timeframe and identity of attacker
  • Triage incidents
  • “Show me packet captures I’ve never seen before.”

“Never before seen” is a powerful filter and isn’t limited to attack information. First introduced by Marcus Ranum, “never before seen” can be used to rule out normal network behaviour and only show sources, attacks, and traffic flows that are truly anomalous. For example, think in terms of the outbound communications from a Web Server. What attacks, clients and outbound communications are new or have never been seen before? In an instant you get an understanding that you don’t need to look for the normal, you are straight away looking for the abnormal or signs of misuse.

Agile Data

Packetloop uses the stack and iterative prototyping techniques outlined in the forthcoming book by Hortonworks’ own Russell Jurney, Agile Data (O’Reilly, March 2013). We use Hadoop, Pig, Mongo and Cassandra to explore datasets and help us encode important information into d3 visualisations. Currently we use all of these tools to aid in our research before we add functionality to Packetloop. These prototypes become the palette our product is built from.

Miss Piggy Takes Manhattan: Pig Meetup at Strata NYC on Wed, Oct 24th

There will be a Pig meetup at Strata NYC/Hadoop World, at 6:30PM on Wed, Oct 24th in the Bryant Room of the Hilton New York. This will also be the inaugural meeting of the NYC Pig User Group, which Doug Daniels of Pig contributor Mortar Data was good enough to organize. We look forward to future Pig meetups in NYC!

Hortonworks’ own Daniel Dai @daijy, VP of Apache Pig, will present on new features in Pig 0.11. You can view a summary of JIRA tickets for Pig 0.11 here. New features include the CUBE operator, a new RANK operator, the addition of a DateTime type, speed improvements via SchemaTuple, and many others.

More information is available on the Pig meetup page: http://www.meetup.com/PigUser/events/85047782/.

Those of you too young to understand the Miss Piggy reference, should look here.

YARN Meetup at Hortonworks on Friday, Oct 12

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

Agenda includes:

  • YARN committers from Yahoo will present on current YARN deployments at Yahoo, including lessons learned, stability, etc.
  • Hortonworks YARN committers will talk about upcoming features such as RM Restart, Container Re-use for MR, Multi-resource scheduling etc.
  • Chris Riccomini from LinkedIn will talk about his experiences building new applications on top of YARN.

A WebEx session will be available, so attendees from all over the world can participate. Follow the meetup page for more information and updates to the agenda.

If you would like to add to the agenda, please get in touch with Arun, or leave a comment in the meetup page.

More information is available on meetup.com here: http://www.meetup.com/Hadoop-Contributors/events/85353562/.

Alan Gates CHUGs HCatalog in Windy City (Chicago Hadoop User Group)

Alan Gates presented HCatalog to the Chicago Hadoop User Group (CHUG) on 9/17/12. There was a great
turnout, and the strength of CHUG is evidence that Chicago is a Hadoop city. Below are some kind words from the host, Mark Slusar.

On 9/17/12, the Chicago Hadoop User Group (CHUG) was delighted to host Hortonworks Co-Founder Alan Gates to give an overview of HCatalog. In addition to downtown Chicago meetups, Allstate Insurance Company in Northbrook, IL hosts regular Chicago Hadoop User Group Meetups. After noshing on refreshments provided by Hortonworks, attendees were treated to an in-depth overview of HCatalog, it’s history, as well as how and when to use it. Alan’s experience and expertise were an excellent contribution to CHUG. Alan made a great connection with every attendee. With his detailed lecture, he answered many questions, and also joined a handful of attendees for drinks after the meetup. CHUG would be thrilled to have Alan & Hortonworks team return in the future!” – Mark Slusar

Thanks Mark, and anytime you would like us to come to the windy city, let us know! For those of you who couldn’t be there, I have a treat for you, the recording!

Thanks Chicago Hadoop Community! Stay Classy!

Search Hadoop with Search-Hadoop.com

As the Hadoop ecosystem has exploded into many projects, searching for the right answers when questions arise can be a challenge. Thats why I was thrilled to hear about search-hadoop.com, from Sematext. It has a sister site called search-lucene where you can… search lucene!

Search-Hadoop.com searches across projects – JIRAs, source code, mailing lists, wikis, etc. so you can see design and API docs, as well as questions, answers and general documentation. Filtering by project is a big help – but search-hadoop also lets you see the similarities between projects.

Search Hadoop runs on Solr 3.6.1, but will be moving to Solr 4.0 this Fall. Solr 4.0, aka SolrCloud, is a fully distributed version of Solr (indices are sharded and replicated) that uses ZooKeeper for coordination.

The autocomplete feature is particularly cool. It offers several groups of suggestions separated by a lovely thin pink line, so one can easily pick the suggestion to follow. The motivation is that people searching for info often have an idea what type of content they want to see – issues, ML messages, wiki pages, etc.

A couple of cool features: You can also search by author by clicking on the author name in search results. e.g. http://search-hadoop.com/?q=&fc_author=Russell+Jurney. Queries starting with project names are automatically limited to the project name, e.g. http://search-hadoop.com/?q=pig+join will show only results from Pig.

Answer Big Questions with Big Data

Partner Webinar Series

On September 18 at 10am PT/1pm ET we join our partner Datameer in a webcast aimed at providing answers to some common questions we hear in the industry. Specifically, what are some of the use cases that big data analytics is perfect for?

By looking at some common uses we are seeing, you’ll be able to envision how you can leverage the analytics results from your own data. Ultimately these analytics will lead to uncovering ideas for new business approaches you can use for a huge competitive advantage.

Obviously you need to weigh in the costs required so you can determine if the payoff is worth the investment for your business. What should you be considering when you are trying to decide if Hadoop and big data analytics are going to pay off?

These questions will be the topic for our webinar on September 18 at 10am PT. Join our speakers Matt Schumpert, Director of Solutions Engineering at Datameer and Jim Walker, Director of Product Marketing at Hortonworks in this Big Data Analytics webcast.

Register here.

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

How To Take Big Data to the Cloud

Partner Webinar Series

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks. These experts will discuss and demo:

  • How to spin up large, fully configured Hadoop clusters, quickly, consistently, and reliably.
  • How simple it can be to manage a large cluster of virtual machines in the cloud.
  • How to select specific software components to meet your specific needs.

The live demonstration will use StackIQ Enterprise Data, Powered by Hortonworks Data Platform, to create and manage a virtual cluster on Amazon Web Services. It will include creating, configuring, and provisioning a Hadoop cluster, including loading it up and running real jobs on it.

Register for the webinar today, and join us on Thursday September 13 at 10 AM Pacific time, as we explore the world of Big Data in the Cloud with StackIQ and Amazon Web Services.

You may even be able to try it out yourself for free with a $150 coupon to access Amazon’s EC2 if you’re one of the first 25 to register. https://www3.gotomeeting.com/register/942019582

Twitter Analytics Presents Hadoop and Pig at UC Berkeley

Twitter Analytics presented their distributed infrastructure, including Hadoop and Pig, at a UC Berkeley iSchool special course called INFO 290: Analyzing Big Data with Twitter. Twitter is a major contributor to many Apache projects. The course was over-subscribed and was a great success, as students got to learn from practicing data scientists using Hadoop on truly massive datasets. The entire lecture series is available here.

Bill Graham @billgraham, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented an Introduction to Hadoop. His slides are available here. His presentation gives a comprehensive introduction to Apache Hadoop including its history, motivation, practice and operation.

Jonathan Coveney @jco, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented Pig at Twitter. Slides for this presentation are available here. His presentation gives a comprehensive explanation of Apache Pig‘s philosophy, use and intricacies. It is one of the most thorough introductions to Pig I’ve seen and will serve as excellent documentation for beginners and intermediate Pig users alike.

Hats off to Twitter for their contribution to Apache open source and education. More Pig talks and papers are available on the Pig Confluence here.

Recap of the August Pig Hackathon at Hortonworks

The August Pig Hackathon brought Pig users from Hortonworks, Yahoo, Cloudera, Visa, Kaiser Permanente, and LinkedIn to Hortonworks HQ in Sunnyvale, CA to talk and work on Apache Pig.

hackers hacking away at the august 2012 pig hackathon at Hortonworks in Sunnyvale, CA

Jonathan Coveney and Bill Graham from Twitter walked newer Pig users through how Pig translates a Pig Latin script to map reduce jobs and went over how to read the output of explain. For those interested, Hortonworks founder Alan Gates covers this in Chapter 1 of Programming Pig.

Thejas Nair walked through how to contribute patches to Pig and how to work with committers to get the patches in. You can learn more about this on the Pig Wiki.

The group talked through the proposal for a new EvalFunc interface that would make it much easier to write UDFs or User Defined Functions for Pig. Part of what makes Pig so powerful is its extensibility, and making that even easier would make Pig a better tool. A discussion in JIRA ticket PIG-2421 is availble if you want to collaborate on improving Pig’s eval funcs.

Alan Gates presented some thoughts on building a generic DAG (directed acyclic graph) execution and optimization engine that could be used by Pig and Hive and that would take advantage of new features in Hadoop 2.0. This would reduce duplication between the projects as well as allow users to share UDFs between them. We covered using Pig and Hive together and via HCatalog in previous posts.

You don’t have to be a Pig expert to attend a Pig meetup – all levels of proficiency are invited. Committers love to meet new users that appreciate their work. One attendant said, “There were many pig commiters at the meetup. The Twitter and HortonWorks people were very helpful.”

To find out about more Pig meetups, join the Pig User group on meetup. We can’t wait to see you there!

City Hall is Getting Schooled

Nothing happens in a vacuum anymore.  Cities now have the ability to use information collected from a massive variety of sources in order help solve common city problems.  The information can arise from anywhere – tweets, blog posts, and meter readings all can serve to inform public officials (and citizens as a whole) about how to better interact in a data-drenched world.

Most famously, IBM’s Smart Cities initiative looks at how city governments meet the needs of their expanding populations by using available resources more efficiently.  This is in direct contrast to the older practices of extracting ever-greater amounts of natural resources.  For example, optimizing how power plants distribute energy to city grids can alleviate power concerns during the summer months were A/C usage creates huge power demands.  The insight into how to do this better is always better than blind foresight.

(IBM has a white paper about their smarter cities initiative.)

Yet, just a single person can make a difference.  The Gothamist has an article of one observant filmmaker who decided to record a video of NYC subway goers tripping over the same staircase step in the course of a single day.  He then uploaded the video to YouTube where it immediately went viral.  What’s more impressive is how city workers later went on to repair the staircase step later that same day.

The same can be said for StreetBump, a smartphone app reviewed by the Huffington Post.  The app works by using a smartphone’s accelerometer to record the exact GPS location of potholes when a driver passes over cracks in the road.  This information can be relayed back to cities to improve the road conditions on a more dynamically rich scale than otherwise possible.

Mayors of cities have also taken the lead in communicating with their constituents using big data-enabled technologies.  New Jersey’s Star Ledger recently ran a report on the Cory Booker, the mayor of Newark and his persistent use of technology to directly (and personally) address the needs of individual Newarkers.  In the past, he has accepted tweets to fix potholes and repair stoplights in an aim make the position of mayor more accessible to the average person.

All of these points of data can be used to improve the way we interact with our increasingly more-connected world.  Officials can use all of this information to help improve the lives of everyone and work toward creating more livable cities.

Go to page:12345