Category Archives: Industry Happenings


Apache Hadoop YARN Meetup at Hortonworks – ReCap!

Introduction

The Apache Hadoop YARN meetup at Hortonworks on October 12, 2012 we previously announced was a resounding success. We had a very good turnout of around seventy people from the community.

Meetup sessions

Deployments at Yahoo!

The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.

  • described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
  • efforts undergone relation to battle testing YARN including application validation and performance benchmarking.
  • summed it up with suggestions for improvements to issues like UI loading, lack of generic history server etc.

Chris Riccomini’s on “Building Applications on YARN”


Chris Riccomini from LinkedIn then presented about his experience in “Building Applications on YARN”. He briefly covered the anatomy of a YARN application and then jumped into various dimensions a YARN application developer should think about – deployment, metrics, logging, application specific configuration to name a few.

The most interesting bits about his presentation include how, pre-production, small instances of YARN clusters can be utilized to develop applications in an agile manner. For example, one could start with using local file system and avoiding HDFS to minimize the operational effort, and then switch over to a full-blown distributed file system when the desire for scalability crosses a threshold. Also worth attention is how YARN’s web-service APIs can be exploited to build custom dashboards.

Chris posted his notes from the meetup and slides on his blog.

YARN API Discussion

After that, Arun recapped the YARN’s powerful scheduling API available to the application developers for using the cluster resources. He walked us through the scheduling concepts, and rounded it up with how scheduling happens in the context of an example MapReduce job.

Bikas and I then proceeded to give a brief overview of what all APIs are available to application developers. We described some of the pain points with the APIs that various users indicated in the recent past and efforts underway to address some of them. To enumerate a few:

  • How to make the scheduling logic explicit – for e.g, that scheduler looks for free resources on a node, then proceeds to a rack and then off-rack
  • Multiple ways to release and reject containers
  • Use-cases which require resources on specific nodes and/or racks
  • Applications that want to avoid/blacklist some nodes and/or racks
  • Limitations on the number of threads making resource requests

We opened the API discussion for further feedback. This exercise was very fulfilling. We discovered how various users were experimenting with the APIs and what pitfalls and limitations they ran into. Some concrete suggestions include:

  • Libraries for recovering AMs, launching containers
  • A generic framework for applications to expose specific data via http or web-services.
  • A generic application history server
  • Tagging nodes with labels like GPU etc and use these labels for scheduling. This is an extension of data locality

Our slides are available here.

Efforts Underway

After a short break, Alejandro Abdelnur from Cloudera briefly talked about the efforts underway to augment YARN with cpu-isolation using cgroups.

Finally, Siddarth Seth from Hortonworks talked about his work on modifying MR application to reuse containers for jobs both large and small. This exciting development opens new innovations in the MapReduce land like intermediate output aggregation. You can read through Sid’s presentation below. The core points covered are:

  • Decoupling the TaskAttempt and Container concepts inside MR AM
  • Add new first class concepts of Container, Node and Scheduler
  • The current state of the effort
  • New avenues this transition opens up – custom task types, output aggregation, performance optimizations.

His slides are available here.

Conclusion

The success of this meetup reaffirmed the excitement of the community about YARN. This also strengthened our desire to make it a recurring event. We look forward to the next one, with hopefully more turnout, extended brainstorming, and of course, more pizza and beer :)

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution.  Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL.  This is a great approach but it lacks integration of metadata and metadata exchange.  With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product.  SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata.  Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze.  All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration
In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics.  With this package, these valuable functions can now be applied to big data in Hadoop.  This shortens the time it takes for an analyst to explore and discover value in big data.  And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Lighting up operations
Often overlooked when an organization considers Hadoop is the impact on IT operations.  They are tasked with making sure a cluster is functional.  Well, these guys have countless tools to perform their job and for Teradata they use Viewpoint Teradata Vital Infrastructure.  In this release, we have integrated the management and monitoring communications use by Ambari with these monitoring tools. Now, the ops guy has a true single pane of glass to monitor the Teradata environment AND the Hadoop cluster used to provide the big data analytics.

Some details on the appliance
The Teradata Aster Big Analytics Appliance runs on proven Teradata hardware, leverages the most current Intel® processor chip technology, SUSE® Linux operating system, and market-leading enterprise-class storage. It can be configured to store a maximum of 5 petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data for Hadoop.

“The Teradata Aster Big Analytics Appliance offers the faster path from diverse big data acquisition to big insights, and seamlessly delivers these insights to the business owners. Unmatched by any other stack in the industry, it enables organizations to overcome the barriers to big data analytics and provides a high-definition view of the business to optimize operations.”– Scott Gnau, president, Teradata Labs.

This is unique and it ushers in a new approach to big data analytics.

Big Data in London – Thoughts From the Tube

Hortonworks sponsored the O’Reilly Strata conference in earlier this month at the Hilton Metropole in London. It was great meeting big data enthusiasts at the conference. We had fun giving away our little green mascot and came away pleasantly surprised at the state of interest in Big Data in the UK and Europe. There were over 500 attendees, which for a first time conference is a very good result. Conversations ranged from introductory “What is Apache Hadoop?” to deep discussions regarding how Hadoop was being used in production today. After talking to other vendors, attendees and organizers it appears that the market is somewhere between 12 and 18 months less mature than the Big Data market in the US. That said we think adoption could occur more quickly in the US as the state of the technology and ecosystem evolves heading into 2013. Below are some perspectives from our team at this conference.

Inspiration from the Tube

Riding the tube around London we couldn’t help but take some guidance and inspiration from the prominently placed signs for the “Way Out” and frequent announcements warning travelers to “Mind the Gap”. These signs and notices as informal guidance for approaching the Big Data market.

Way Out

As more and more organizations realize that their current systems are at risk of being buried underground by the onslaught of Big Data many are starting to realize that Hadoop offers a Way Out.  How you ask? Because it gives them a low cost scale out infrastructure to capture, process and exchange data. With Hadoop they now can cluster commodity servers and storage together to capture, process and exchange data with existing systems. At the same time a modern enterprise ready Hadoop platform like the Hortonworks Data Platform enables them to efficiently and effectively operate these clusters as well but that is for another post.

Mind the Gap

That said when selecting a Hadoop platform it is important to Mind the Gaps in the technology and look for a platform that is being deeply integrated with existing enterprise architecture systems. The best solutions to rely on are those that are created through engineering level engagements to maximize performance and optimize the interaction between the systems.

Deep technical interest and curiosity

Many of the visitors had technical questions, for which we pulled in our UK R&D person, Steve Loughran, armed with copies of the Hadoop 1.x and trunk source trees. The content of those discussions showed that people are already using Hadoop at scale in parts of Europe and nearby. Indeed, we had conversations with people as far away as Finland and Israel, showing that this conference drew a wide audience – and that those people were building up their skills in the technology and applications of Big Data.

There was also the London-and-South of England Hadoop community, who tend to know each other from the London HUG events and other workshops. Many of these are drawn from various startups -Last.fm being one of the earliest adopters of Hadoop; Datasift, Mendeley and others now becoming well known. Alongside them: the enterprises with datasets that historically were too big to store cost-effectively: the telcos, the media companies with their advert click throughs, and the like. These people have the data -and are ramping up the skills to make use of it. For these organizations, bringing up large Hadoop clusters matters -and they’ve realized that Hadoop internals aren’t something they need to know themselves -any more than they need Linux kernel skills. What they do need is Data Science skills: people who know the right questions to ask of that data, how to ask Hadoop for the data to provide the answers, how to interpret those answers -and how to present them.

Many of the Strata topics looked at these problems: cleaning up data, conducting effective A/B tests, and examples of highly effective visualizations of large and near-real-time data sources. One memorable talk from the Formula 1 race team McLaren covered how they had transformed their organization to be data-driven; to use the answers from their in-race telemetry and information gleaned about competitors from public sources to shape their thinking. This shows a future for organizations -to copy McLaren, Google and others to not only collect and analyze data -but to embrace it.

Exciting future for Big Data in Europe

Overall we had many great conversations with attendees regarding their current and more commonly future plans for use of Hadoop and other Big Data technologies. Many of the sessions were packed including a standing room only Microsoft talk on current Hadoop related integration and future plans.

Awareness of Apache Hadoop as a technology was respectable but certainly below that in the US.

Interest in technical and business benefits of Hadoop

Shaun Connolly’s sessions on Hadoop and data warehousing were well attended, as was Steve Loughran’s session on High Availability Hadoop including a live demo.

Finally, Transport for London are themselves participants in the Big Data revolution -their live data feeds of tube, bus and bike-sharing are all there for analysis and integration with other data sources: http://www.tfl.gov.uk/businessandpartners/syndication/16493.aspx. If anyone wants some interesting datasets to learn Pig on, these could be them.

Overall, this was well run event and featured interesting keynotes. It was vibrant, ripe for growth, and was very honored to be approached by multiple user groups seeking speakers from Hortonworks to talk about big data experiences and expertise from this conference.

Thanks to those that attended our sessions and visited and chatted with us at our booth. For a copy of Shaun Connolly and Steve Loughran’s presentations, you can acces it here and here.

Until next time London, mind the gap.

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview!

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

Learn More
Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 2.0 Alpha.

To learn more about Apache Hadoop YARN, Arun Murthy — Chair of Apache Hadoop PMC and YARN/MapReduce lead – and the rest of Hortonworks YARN development team, have a great four-part Blog series on the technology: one, two, three and four.

Download It
You can download the HDP 2.0 Alpha bits from the Hortonworks Download site.

Tell Us About It
Please visit the HDP 2.0 Alpha Forum to ask questions, get help, provide feedback and hear what others are doing with HDP.

Note: This Alpha release is early access and not for production use. Support is only available via Forums. Additionally, this is an early access release, you might find some incomplete features or a bit of instability.

We are excited about the opportunities that Hadoop 2.0 provides for the future of Hadoop and Big Data. The HDP 2.0 Alpha release is just the beginning. Enjoy!

Teradata Webinar: Business Value with Big Analytics

Back in June we joined Teradata Aster in a webcast “Back to the Future – MapReduce, Hadoop and the Data Scientist” to highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. You can check out the replay here. The discussion focused around how big data architectures could bring more value to businesses using relational DBMS technology and Hadoop, and how the two can coexist.

On October 17th at 10am PDT, Teradata will host a webcast that raises the level and builds on the important theme of Hadoop and business value, recognizing that many are deeply involved with discovering the easiest and best way to bring their data to life. Teradata Aster plans to show how executives, analysts and IT managers can leverage breakthrough enterprise class big analytics solutions to inject innovative analytics into business processes for better data-driven decisions. All this while minimizing risk, maximizing ROI and accelerating time-to-value.

Read more or register for this webcast and join speakers Scott Gnau, President, Teradata Labs, Teradata Corporation, and Tasso Argyros, Co-President, Teradata Aster and get the inside scoop on Teradata Aster’s newest big analytics technology.

Big Data Security Part One: Introducing PacketPig

Series Introduction

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based on Apache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

Introducing Packetpig

Intrusion detection is the analysis of network traffic to detect intruders on your network. Most intrusion detection systems (IDS) look for signatures of known attacks and identify them in real-time. Packetpig is different. Packetpig analyzes full packet captures – that is, logs of every single packet sent across your network – after the fact. In contrast to existing IDS systems, this means that using Hadoop on full packet captures, Packetpig can detect ‘zero day’ or unknown exploits on historical data as new exploits are discovered. Which is to say that Packetpig can determine whether intruders are already in your network, for how long, and what they’ve stolen or abused.

Packetpig is a Network Security Monitoring (NSM) Toolset where the ‘Big Data’ is full packet captures. Like a Tivo for your network, through its integration with Snort, p0f and custom java loaders, Packetpig does deep packet inspection, file extraction, feature extraction, operating system detection, and other deep network analysis. Packetpig’s analysis of full packet captures focuses on providing as much context as possible to the analyst. Context they have never had before. This is a ‘Big Data’ opportunity.

Full Packet Capture: A Big Data Opportunity

What makes full packet capture possible is cheap storage – the driving factor behind ‘big data.’ A standard 100Mbps internet connection can be cheaply logged for months with a 3TB disk. Apache Hadoop is optimized around cheap storage and data locality: putting spindles next to processor cores. And so what better way to analyze full packet captures than with Apache Pig – a dataflow scripting interface on top of Hadoop.

In the enterprise today, there is no single location or system to provide a comprehensive view of a network in terms of threats, sessions, protocols and files. This information is generally distributed across domain-specific systems such as IDS Correlation Engines and data stores, Netflow repositories, Bandwidth optimisation systems or Data Loss Prevention tools. Security Information and Event Monitoring systems offer to consolidate this information but they operate on logs – a digest or snippet of the original information. They don’t provide full fidelity information that can be queried using the exact copy of the original incident.

Packet captures are a standard binary format for storing network data. They are cheap to perform and the data can be stored in the cloud or on low-cost disk in the Enterprise network. The length of retention can be based on the amount of data flowing through the network each day and the window of time you want to be able to peer into the past.

Pig, Packetpig and Open Source Tools

In developing Packetpig, Packetloop wanted to provide free tools for the analysis network packet captures that spanned weeks, months or even years. The simple questions of capture and storage of network data had been solved but no one had addressed the fundamental problem of analysis. Packetpig utilizes the Hadoop stack for analysis, which solves this problem.

For us, wrapping Snort and p0f was a bit of a homage to how much security professionals value and rely on open source tools. We felt that if we didn’t offer an open source way of analysing full packet captures we had missed a real opportunity to pioneer in this area. We wanted it to be simple, turn key and easy for people to take our work and expand on it. This is why Apache Pig was selected for the project.

Understanding your Network

One of the first data sets we were given to analyse was a 3TB data set from a customer. It was every packet in and out of their 100Mbps internet connection for 6 weeks. It contained approximately 500,000 attacks. Making sense of this volume of information is incredibly difficult with current tooling. Even Network Security Monitoring (NSM) tools have difficult with this size of data. However it’s not just size and scale. No existing toolset allowed you to provide the same level of context. Packetpig allows you to join together information related to threats, sessions, protocols (deep packet inspection) and files as well as Geolocation and Operating system detection information.

We are currently logging all packets for a website for six months. This data set is currently around 0.6TB and because all the packet captures are stored in S3 we can quickly scan through the dataset. More importantly, we can run a job every nightly or every 15 minutes to correlate attack information with other data from Packetpig to provide an ultimate amount of context related to security events.

Items of interest include:

  • Detecting anomalies and intrusion signatures
  • Learn timeframe and identity of attacker
  • Triage incidents
  • “Show me packet captures I’ve never seen before.”

“Never before seen” is a powerful filter and isn’t limited to attack information. First introduced by Marcus Ranum, “never before seen” can be used to rule out normal network behaviour and only show sources, attacks, and traffic flows that are truly anomalous. For example, think in terms of the outbound communications from a Web Server. What attacks, clients and outbound communications are new or have never been seen before? In an instant you get an understanding that you don’t need to look for the normal, you are straight away looking for the abnormal or signs of misuse.

Agile Data

Packetloop uses the stack and iterative prototyping techniques outlined in the forthcoming book by Hortonworks’ own Russell Jurney, Agile Data (O’Reilly, March 2013). We use Hadoop, Pig, Mongo and Cassandra to explore datasets and help us encode important information into d3 visualisations. Currently we use all of these tools to aid in our research before we add functionality to Packetloop. These prototypes become the palette our product is built from.

Insights from DataWeek: San Francisco

I spent some time at the first ever DataWeek in San Francisco last week.  It is a brand new show and it was very well-run, spread across a few cool spaces with an interesting mix of novice to experienced data professionals.  They had a good blend of labs, speakers, panels and great networking opportunities.  In all, it was great and a big thanks and kudos to the organizers.

I took part in a panel and also presented a three-hour overview of Hadoop.  There were some good questions thrown at the panel but more interesting was the discussion over the three sessions.  Before each presentation, I ran an informal survey of the room to get a sense of audience and there was an even mix of complete novice, those new to Hadoop and experienced practitioners.

Each session had lively discussion and great engagement.  There were three segments to the presentation: Hadoop market overview, Intro to Hadoop, Hadoop usage patterns.  I would also say that, in general there were three key points that the audience really seemed to focus on.

Forest/Trees :: Distribution/Project
There are Hadoop distributions and there is the Apache Hadoop project.  When you are new to this world and learning through all the media, you can get lost in this terminology and the clarification of this point seemed important to the some of the Dataweek crowd.

The conversation went a little like this… the Apache Hadoop project comprises MapReduce and HDFS.  Sometimes we refer to this as “core Hadoop” as it is the central focus of a Hadoop project. It provides redundant and reliable storage and distributed processing or compute. In order for Hadoop, the project, to become a more complete data platform, we, the community have created several related projects that make Hadoop more useful and dependable. When we package these projects (Hive, HBase, Pig, HCatalog, Ambari, ZooKepper, Oozie, etc…) with core Hadoop, this becomes a “distribution”.

A distribution came about because each project has its own release cycle and getting the right versions together is sometimes difficult.  Also, a distribution will package the projects and provide an installer to make deployment much easier.

Insatiable Thirst for Use Cases
Design Patterns by Gamma et al. has and always will be one of the best developer books written. I like design patterns because they take a lot of data and boil it down to naturally occurring state.  They make sense of chaos.

In the third hour of our overview, we presented some reusable patterns of use for Hadoop, namely, Refine, Explore and Enrich.  With refine we apply a known process to a set of big data to extract results and use them in a business process.  With explore, we use Hadoop to discover new information that was not attainable before.  Often with explore, we will operationalize findings to be used in the refine patters.  Finally with enrich we use big data to supplement and improve a user experience for an online application.

This session was scheduled for 45 minutes and went the full hour and beyond.  There were a LOT of questions and interactions.  The material was well received by the experienced professionals as it made sense of their projects and for those new to Hadoop it provided a good sense of where to start or how to approach this big data thing.

We Face Challenges
It seemed everyone wants to get started but are presented with challenges.  There were really three areas of focus in this discussion, acquiring skills, managing a cluster and building a business case. The business case and validation of a project was interesting as some said you should just start with a project and run with it, while others advocated careful planning and a formal process.I guess in the end both sides were right.

It depends on your org and what they can stomach really.I will add my two cents however…  Hadoop is open source and available to you today so use it and start addressing all three of the challenges in the immediate future.

As noted, Dataweek was a huge success and I am honored to have taken part in what surely will be a regular event.  Congrats to the organizers on the birth of a new show.

Miss Piggy Takes Manhattan: Pig Meetup at Strata NYC on Wed, Oct 24th

There will be a Pig meetup at Strata NYC/Hadoop World, at 6:30PM on Wed, Oct 24th in the Bryant Room of the Hilton New York. This will also be the inaugural meeting of the NYC Pig User Group, which Doug Daniels of Pig contributor Mortar Data was good enough to organize. We look forward to future Pig meetups in NYC!

Hortonworks’ own Daniel Dai @daijy, VP of Apache Pig, will present on new features in Pig 0.11. You can view a summary of JIRA tickets for Pig 0.11 here. New features include the CUBE operator, a new RANK operator, the addition of a DateTime type, speed improvements via SchemaTuple, and many others.

More information is available on the Pig meetup page: http://www.meetup.com/PigUser/events/85047782/.

Those of you too young to understand the Miss Piggy reference, should look here.

InfoQ: Hadoop and Metadata (Removing the Impedance Mis-match)

InfoQ has an article out today on HCatalog by Hortonworks’ own Alan Gates and Russell Jurney.

Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?

Check out the article at InfoQ: http://www.infoq.com/articles/HadoopMetadata

Hadoop Features Large at Stanford XLDB

Hadoop featured prominently at Stanford’s annual XLDB conference last week, as representatives from academia and industry gathered to discuss Extremely Large Databases. The conference program, with slides are available: http://www-conf.slac.stanford.edu/xldb2012/ProgramC.asp. A highly technical lineup presented on Big Data in biology and physics, and cloud computing and Hive in particular were topic areas.

Hortonworks’ own Ashutosh Chauhan @ashutoshchauhan, an Apache Pig, Hive and HCatalog committer, presented ‘Hive vs Pig: Similarities and Differences‘ (slides).

Twitter Analytics Presents Hadoop and Pig at UC Berkeley

Twitter Analytics presented their distributed infrastructure, including Hadoop and Pig, at a UC Berkeley iSchool special course called INFO 290: Analyzing Big Data with Twitter. Twitter is a major contributor to many Apache projects. The course was over-subscribed and was a great success, as students got to learn from practicing data scientists using Hadoop on truly massive datasets. The entire lecture series is available here.

Bill Graham @billgraham, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented an Introduction to Hadoop. His slides are available here. His presentation gives a comprehensive introduction to Apache Hadoop including its history, motivation, practice and operation.

Jonathan Coveney @jco, a Data Systems Engineer at Twitter Analytics and Apache Pig committer, presented Pig at Twitter. Slides for this presentation are available here. His presentation gives a comprehensive explanation of Apache Pig‘s philosophy, use and intricacies. It is one of the most thorough introductions to Pig I’ve seen and will serve as excellent documentation for beginners and intermediate Pig users alike.

Hats off to Twitter for their contribution to Apache open source and education. More Pig talks and papers are available on the Pig Confluence here.

Recap of Hadoop Summit 2012

I wanted to take this opportunity to say thanks to the more than 2,200 attendees, speakers and sponsors that helped to make Hadoop Summit 2012 a great success. There was tremendous buzz throughout the conference; exceeding the excitement levels of all past Hadoop conferences. It’s a great indicator for the future of Apache Hadoop and the broader big data ecosystem.

The content from this conference was outstanding, from the opening keynotes to the last round of breakout sessions. I wanted to thank the track chairs (Abhishek Mehta, Ashish Thusoo, Avik Dey, Ben Reed, Peter Sirota and Val Bercovici) for making the hard decisions that led to such an outstanding agenda. I thought the group did a great job selecting the right mix of technical, use case and best practices sessions for developers, operators and analysts. I would also like to thank the more than 110 speakers for putting in the time and effort to share their Apache Hadoop experiences.

All of the sessions at this year’s conference were recorded and we are in the process of editing these videos for placement on the Hadoop Summit website. We have also now posted most of the slides as well. Simply visit the Sessions page to access the slides and recordings.

I am pleased to announce that all of the keynote session recordings are now available. These include compelling presentations from the following speakers:

Geoffrey Moore (author of “Crossing the Chasm” and “Escape Velocity”)

Scott Burke (SVP, Advertising & Data, Yahoo!)

Dr. Philip Shelley (CTO, Sears)

Scott Gnau (VP and GM of R&D, Teradata)

Shaun Connolly (VP of Corporate Strategy, Hortonworks)

Eric Baldeschwieler (CTO, Hortonworks)

Also, if you have not yet seen the introductory video from Hadoop Summit, I strongly encourage you to watch it now (below). I have heard from quite a few folks that this video got them even more excited about the role they have played in the Apache Hadoop ecosystem.

(click HERE for a full screen version on Vimeo)

On behalf of this year’s co-hosts Hortonworks and Yahoo!, let me again thank everyone for their role in making Hadoop Summit 2012 such a success. Because of the emergence of Apache Hadoop as the foundation of the next generation enterprise data architecture, I have no doubt that next year’s conference will be even bigger and better. I can’t wait.

~ John Kreisa

Hortonworks @ TheCUBE

By any measure, last week’s Hadoop Summit was a tremendous success. It brought together more than 2,200 people from throughout the Apache Hadoop ecosystem to share Hadoop knowledge, ideas, best practices, and interesting use cases. It was also a great chance for big data vendors to make announcements and demonstrate new and exciting solutions.

For those of you that missed the conference, or missed a particularly interesting presentation, we have some good news. Each of the 90+ keynotes and breakout sessions were recorded and we will be posting these sessions online at hadoopsummit.org over the coming days once the editing is completed.

In the meantime, I would like to draw your attention to TheCUBE videos featured on SiliconAngle TV. As conference organizers, we were very fortunate to be able to support the team from TheCUBE, including John Furrier (@furrier) and Jeff Kelly (@jeffreyfkelly). They did an outstanding job of streaming interviews with many of the industry thought leaders and providing some excellent insight into the conference happenings for those that could not attend. These sessions are all now available via their website.

Read More

My Review of Hadoop Summit 2012

The fifth annual Hadoop Summit drew to a close last week, with over 2200 Hadoopniks in attendance. While there were many innovations demonstrated, for me the best action was about Pig, HCatalog and Hive from Hortonworks and Twitter.

At the Hadoop Summit Pig Meetup, Twitter announced Ambrose, which now includes an excellent graph layout of Pig EXPLAIN data. This visualization can be used to debug and better understand your Pig scripts.

Read More

An Advance Look at Hadoop Summit

Hadoop Summit is just around the corner and by that, I mean next week! There is still time to register for the conference but please do it soon as the conference is filling up quickly. Today is also the last day in which online registration will remain open. After today, you will need to register on-site at the conference itself.

This year’s Hadoop Summit conference, now in its fifth year, promises to be the biggest and best yet. In fact, there are already more people registered for Hadoop Summit 2012 than any other Hadoop conference ever!

I wanted to take this opportunity share some of the highlights for next week’s conference:

Geoffrey Moore and Other Compelling Keynote Speakers:

Geoffrey Moore, author of “Crossing the Chasm” and “Escape Velocity”, will share his views on “Digitizing the World, the Driving Force Behind Apache Hadoop’s Adoption Life Cycle”. You will also hear from other industry luminaries, who will share their vision for where Apache Hadoop is going and how it is destined to become the foundation for the next generation enterprise data platform.

Read More

Go to page:123