Category Archives: Hortonworks Topics


Hadoop Summit Europe Call for Papers Ends this Friday, November 30th

The Hadoop Summit Europe official call for papers ends this Friday, November 30th – so be sure to get your session submissions in this week!

Hadoop Summit Europe is March 20, 21 at the Beurs van Berlage in Amsterdam, Netherlands. You still have time to submit an abstract now!

The four content tracks are:

Applied Hadoop

Sessions in this track focus on applications, tools, algorithms and data science as well as areas of advanced research and emerging applications that use and extend the Hadoop platform. Sessions will cover examples of innovative data processing applications and algorithms for performing the most common statistical analysis as well as supporting the latest advances in artificial intelligence and machine learning.

Operating Hadoop

This track focuses on the deployment and operations of Hadoop clusters with an emphasis on tips, tricks, and best practices. Sessions will cover the full deployment lifecycle from installation, configuration, and initial production deployment to large-scale roll out. Reference architectures that maximize performance while minimizing costs will also be covered.

Hadoop Futures

This track takes a technical look at the key open source projects and research efforts driving innovation in and around the Hadoop platform. Attendees will hear from the technical leads, committers, and expert users who are actively driving the roadmaps, key features, and advanced technology research.

Integrating Hadoop

For many, Hadoop success will largely depend on the ability to integrate with existing data-driven and data management technologies. No matter if it is streaming, batch or real time interaction, these integration points are what exposes the value of Hadoop to the rest of the enterprise. This track This track focuses on Hadoop + enterprise (in particular databases, data warehouses, NoSQL, etc.). Sessions will explore these key integration points and will provide deployment and production examples of successful Hadoop integration within the enterprise today.

Agile Data European Megatour, then Home to Atlanta!

Agile Data hits the road this month, crossing Europe with the good news about Hadoop and teaching Hadoop users how build value from data using Hadoop to build analytics applications.

We’ll be giving out discount coupons to Hadoop Summit Europe, which is March 20-21st in Amsterdam!

  1. 11/3 – Agile Data @ The Warsaw Hadoop Users Group
  2. 11/5 to 11/6 – Attending ApacheCon Europe 2012 in Sinsheim, Germany. Say hello!
  3. 11/7 – Agile Data @ The France Hadoop Users Group in Paris
  4. 11/8 – Agile Data @ Netherlands Hadoop Users Group in Utrecht
  5. 11/12 – Agile Data @ Hadoop Users Group UK in London.
  6. 11/13 – Agile Data @ HP Labs in Bristol, England.
  7. 11/15 – Agile Data @ The combined Data Science ATL / Atlanta Hadoop Users Group

  8. 11/16 – Agile Data @ The Emory Library
  9. 11/19 – Agile Data @ The Atlanta MongoDB Users Group

I’m writing this from Warsaw, the first stop on my tour. This is my first time in Poland, and I’m excited to be speaking tonight at the Warsaw HUG and look forward to hearing about Hadoop in Poland. Tomorrow I’ll be checking out the sites, so let me know if you’d like to volunteer as tourguide in exchange for free, on the spot consulting!

You can view the incomplete book on O’Reilly OFPS here – I’ll be updating it daily for the next three weeks, so check Chapter 10, where I use Pig to build a graphical model in an attempt to improve my wife’s response rate to my emails :) . Code examples for the book are available here on github.

If you can’t make one of the talks, check out the slides below from my DC-HUG presentation, and help spread the good news!



DINOSAURS ARE REAL: Microsoft WOWs audience with HDInsight at Strata NYC (Hortonworks Inside)

You don’t see many demos like the one given by Shawn Bice (Microsoft) today in the Regent Parlor of the New York Hilton, at Strata NYC. “Drive Smarter Decisions with Microsoft Big Data,” was different.

For starters – everything worked like clockwork. Live demos of new products are notorious for failing on-stage, even if they work in production. And although Microsoft was presenting about a Java-based platform at a largely open-source event… it was standing room only, with the crowd overflowing out the doors.

Shawn demonstrated working with Apache Hadoop from Excel, through Power Pivot, to Hive (with sampling-driven early results!?) and out to import third party data-sets. To get the full effect of what he did, you’re going to have to view a screencast or try it out but to give you the idea of what the first proper interface on Hadoop feels like…

There was a comedian who had a bit about… remember when you first saw Jurassic Park for the first time? No matter how old you were, your child-like response was, “DINOSAURS ARE REAL!!!!!!$!!$##!” Our reaction to Jurassic Park was CGI technology disrupting cinema, provoking the same kind of reaction early cinema had on viewers who felt real concern that the horse or train approaching would run them over. At least thats what I learned wasting a lottery-funded academic scholarship on film classes at a state university before having the good sense to fail out and use my time productively.

That feeling you got when you saw your first CGI raptor is what Microsoft’s demo was like, except it went… “HADOOP IS IN EXCEL!!$%!%!%!$????!!!”

This is a serious thing for me, because I hooked up Pig and Excel years ago:

Which is a crappy demo of Hadoop connecting to Excel, but which gives me mucho moral authority to state that Microsoft’s demo was the right way to hook data to Excel. Take it from someone that spent half of his twenties trying to build web applications that could compete against Excel: until data is in Excel… it ain’t real. With Microsoft’s new offering… big data just got real.

To put this into perspective:

And just so you know I’m not bullshitting you about Hadoop and Big Data and Raptors and next thing you know you’re checking for your wallet and nodding awkwardly and trying to find a pause in this lunatic rant to get the hell out of here, I’ll just come out and tell you:

I have a raptor named lame-o-saurus in a Cowboy Curtis hat permanently tattood on my body. Again, we resort to visualization (mind the hair):

To summarize:

  1. I am the world’s primary authority on the wrong way to hook Hadoop to Excel.
  2. I have strange tattoos which affirm the validity of my metaphors.
  3. Microsoft has fundamentally altered Big Data with their HDInsights offering.
  4. Yesterday, a breakthrough happened in the Regent Parlor of the Hilton, NYC.

Visicalc… we’ve come such a long way.

Why Microsoft is committed to Hadoop and Hortonworks

This guest blog post is from Microsoft’s Dave Campbell providing more details on why they chose Hortonworks for  HDInsights.

Last February at Strata Conference in Santa Clara we shared Microsoft’s progress on Big Data, specifically working to broaden the adoption of Hadoop with the simplicity and manageability of Windows and enabling customers to easily derive insights from their structured and unstructured data through familiar tools like Excel.

Hortonworks is a recognized pioneer in the Hadoop Community and a leading contributor to the Apache Hadoop project, and that’s why we’re excited to announce our expanded partnership with Hortonworks to give customers access to an enterprise-ready distribution of Hadoop that is 100 percent compatible with Windows Server and Windows Azure.  To provide customers with access to this Hadoop compatibility, yesterday we also released new previews of Microsoft HDInsight Server for Windows and Windows Azure HDInsight Service, our Hadoop-based solutions for Windows Server and Windows Azure.

With this expanded partnership, the Hadoop community will reap the following benefits of Hadoop on Windows:

  • Insights to all users from all data: Analyze unstructured Hadoop data with familiar tools like Excel.  Through integration with award-winning Microsoft BI tools such as PowerPivot and Power View,  HDInsight enables analysis of all your data (structured or unstructured), including data on Linux .
  • Enterprise-ready Hadoop with HDInsight: Offering the most reliable, innovative and trusted distribution available.  Microsoft and Hortonworks together deliver tighter security through integration with Windows Server Active Directory, ease of management through System Center integration, and built-in high availability with Hortonworks Data Platform 1.1. Additionally, harness your existing .NET and JavaScript developers with rich developer frameworks that enable them to write and deploy MapReduce jobs.
  • Simplicity of Windows for Hadoop: Microsoft HDInsight Server for Windows Server significantly simplifies setup and provisioning of Hadoop through streamlined packaging.  So, you don’t need to choose and test the right Hadoop projects on your own.  In the cloud, Windows Azure HDInsight Service simplifies deployment so much that you can now setup a 16-node Hadoop cluster in only 10 minutes!  System Center simplifies management through integration with the Apache Ambari project.  With this integration IT Operators can manage their Hadoop clusters side-by-side with their databases, applications and other IT assets on a single glass pane.
  • Extend your data warehouse with Hadoop: HDP 1.1 improves integration of Hadoop with relational Data Warehouses with HCatalog.  This provides SQL-like language access to Hadoop so that customers can enrich their analysis by including insights from Hadoop environments into the Enterprise Data Warehouse and BI systems.  Additionally, Microsoft enables customers to extend their Enterprise Data Warehouses with Hadoop connectors for SQL Server and Parallel Data Warehouse appliance.
  • Seamless Scale and Elasticity of the Cloud: Microsoft offers HDInsight both in the cloud and on-premise, with seamless migration across the two environments based on your needs. The cloud service offers elastic scalability, a simplified deployment and management experience and a low-cost way to experiment with Hadoop. Deploying Microsoft HDInsight Server on Windows Server provides enterprise-class security through integration with Active Directory, simplified management with System Center management and availability with a trusted and reliable Hadoop distribution.

This is a very exciting milestone, and we hope you’ll join us for the ride as we continue partnering with Hortonworks to democratize big data.  Download HDInsight today at Microsoft.com/BigData.

Strata NYC Reporting: Monday @ Big Data Camp, Tuesday @ Strata Retrospective

This is Russell Jurney, your Big Data reporter on the ground here at Strata NYC/Hadoop World at the New York Hilton. Monday night’s main event was Big Data Camp. As in any unconference, the best action was in the hallway, meeting people you only know by reputation or from twitter. Highlights were:

  • Microsoft’s demonstration of Excel -> Power Pivot -> Hortonworks Data Platform
  • In light of today’s announcement – the Hadoop market just got MUCH bigger :)

  • Druid: Real-Time Analytics at a Billion Rows Per Second by Eric Tschetter, Co-founder of Metamarkets
  • In-RAM stores are an interesting new development as RAM becomes cheaper and cheaper, and can augment a Hadoop-centric workload.

  • The Horrors Hidden in Your Models by Steven Hillion
  • This talk stressed the importance of unit testing your statistical models and finding areas where they fall-over, then working with customers to understand the problem. A humorous use-case involving a hoax ‘finger-in-chili’ incident was examined.

Tuesday’s tutorial sessions were great. My favorites were:

Check back tomorrow for coverage of Wednesday’s technical sessions!

Enabling Big Data Insight for Millions of Windows Developers

At Hortonworks, we fundamentally believe that, in the not-so-distant future, Apache Hadoop will process over half the world’s data flowing through businesses. We realize this is a BOLD vision that will take a lot of hard work by not only Hortonworks and the open source community, but also software, hardware, and solution vendors focused on the Hadoop ecosystem, as well as end users deploying platforms powered by Hadoop.

If the vision is to be achieved, we need to accelerate the process of enabling the masses to benefit from the power and value of Apache Hadoop in ways where they are virtually oblivious to the fact that Hadoop is under the hood. Doing so will help ensure time and energy is spent on enabling insights to be derived from big data, rather than on the IT infrastructure details required to capture, process, exchange, and manage this multi-structured data.

So how can we accelerate the path to this vision? Simply put, we focus on enabling the largest communities of users interested in deriving value from big data.

Since one of the world’s most widely used business intelligence tools is Microsoft Excel, and since Microsoft is arguably one of the best companies at enabling and mobilizing large and vibrant developer communities, needless to say we at Hortonworks are excited and bullish on the expansion of our partnership with Microsoft.

Today Microsoft unveiled previews of Microsoft HDInsight Server and Windows Azure HDInsight Service, big data solutions that are built on Hortonworks Data Platform (HDP) for Windows Server and Windows Azure respectively. These new offerings aim to provide a simplified and consistent experience across on-premise and cloud deployment that is fully compatible with Apache Hadoop.

This news represents a significant inflection point for the big data market in general and for the importance of open source Apache Hadoop in particular. Unlocking the Windows Server and Windows Azure markets for Hadoop means more businesses will be able to tap into its benefits.

Moreover, these new offerings represent months of joint engineering work across both the Microsoft and Hortonworks engineering and product teams. Microsoft’s commitment to doing this work in a way that improves open source Apache Hadoop and related Apache projects has been unwavering; which translates into goodness for the open source community.

I encourage you to try out the fruits of our labors in one of two ways:

• Download Microsoft HDInsight Server and play with Hadoop on your own Windows machine.
• Access Windows Azure HDInsight Service and play with Hadoop in the cloud.

I encourage you to go to http://hortonworks.com/partners/microsoft/ in order to learn more and get started!

Finally, check out Microsoft’s announcement for more information! http://blogs.technet.com/b/dataplatforminsider/archive/2012/10/22/simplifying-big-data-for-the-enterprise.aspx

HBase Futures

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Significant work has happened in each of the areas outlined above in the 0.94 and 0.96 (currently trunk) branches. For example, the MTTR (mean time to recover) work happening in HBASE-5843 will improve the data availability significantly. HBASE-5305 addresses wire compatibility. HBASE-6055 is the work underway on Snapshots. We believe by solving the above problems, HBase will gain a much wider adoption in the enterprise, and will be considered a very viable option for the use cases it supports.

Doing the above would open HBase to many of the enterprise users, and going forward, we envisage the need for:

  1. Better and improved clients (asynchronous clients, and, in multiple languages)
  2. Cell-level security (access control for every cell in a table)
  3. Multi-tenancy (HBase becomes a viable shared platform for multiple applications using it)
  4. Secondary indexing functionality

The above are some of the areas that Hortonworks is investing in as well. Stay tuned for further updates on these topics.

HBase at Hortonworks: An Update

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform.  HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways.  We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh.  This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0). Here are some notable ones:

  1. HBASE-4128 – Data Block Encoding of KeyValues (aka delta encoding / prefix compression) [PERFORMANCE]
  2. HBASE-4465 – Lazy-seek optimization for StoreFile scanners [PERFORMANCE]
  3. HBASE-5074 – support checksums in HBase block cache [PERFORMANCE]
  4. HBASE-5128 – [uber hbck] Online automated repair of table integrity and region consistency problems [OPERABILITY]
  5. HBASE-3584 – Allow atomic put/delete in one call [FEATURE]
  6. HBASE-5229 – Provide basic building blocks for “multi-row” local transactions [FEATURE]

And 0.94 is only the start.  Expect to see an a huge set of additional features, bug fixes, performance and operational improvements to HBase in the coming months.  As more of our customers have deployed HBase it has become an increasingly important component of HDP 1.  As a result, we’ve really been ramping up our investment in HBase this year, with a focus on enhancing HBase stability and operability.  What follows is a summary of Hortonworkers recent HBase contributions.

1. Reliability improvements

We have established an automated test harness for testing HBase on a nightly basis. The harness involves automated deployment of HBase with a ‘production like’ configuration. After the cluster has been set up, a few heavy duty jobs are run. This has uncovered numerous bugs in the 0.92.x line.

Some of them are:

  • HBASE-5986: Clients can see holes in the META table when regions are being split
  • HBASE-6160: META entries from daughters can be deleted before parent entries
  • HBASE-6679: RegionServer aborts due to race between compaction and split
  • HBASE-6060: Regions’s in OPENING state from failed regionservers takes a long time to recover
  • HBASE-6649: TestReplication.queueFailover occasionally fails [Part-1]
  • HBASE-6758: The replication-executor should make sure the file that it is replicating is closed before declaring success on that file

2. Test Infrastructure Improvements

One of the biggest needs in the community is a good testing framework for HBase. As HBase is becoming more popular as a NoSQL data store, we need to make sure that the system is highly available and reliable in the face of common node failures, and that it is able to withstand the intense, high stress workloads users expect in production environments.

Towards this end we have been building an automated test framework inspired by Netflix’s ChaosMonkey tool. It can run a series of tests, while killing and restarting HBase servers and validate that the test results are correct. This brings to the fore the availability and reliability aspects of the system. For example, if a RegionServer is killed, another RegionServer or a set of RegionServers should pick the data that the killed RegionServer was serving.

Using the APIs provided by this testing framework, one can convert many of the tests in the HBase codebase to run in either unit test mode or in this new challenging “real cluster mode”. The test framework is part of the HBase codebase (via HBASE-6241), and many candidate tests have been identified that can be ported to use the new framework.

For details, please visit HBASE-6201. Slides are available here.

3. Windows Port of HBase

The Microsoft Windows port and certification of HBase is an ongoing joint development effort invovling Hortonworks and Microsoft engineers.  We recently reached an important milestone, getting all of the hbase-0.94 unit tests passing on Windows. Work is underway to commit all the patches to HBase mainline under the umbrella jira HBASE-6814. We are well on the way to our goal of having HBase run equally well on Windows and Unix, opening up the Apache HBase community to a whole new universe of potential users and contributors.

4. HBase with NameNode HA setup and validation

We’ve been working to validate that HBase runs well with the new Apache Hadoop 1.0 HA features.  The HBase HA testing blog is here .

5. The wire-compatibility work targeted for 0.96.x release.

We have done substantial work to move all protocols in HBase including the RPC implementation to use Google’s Protocol Buffers. Most of the work is captured in this umbrella jira – HBASE-5305.

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base.  When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase.  HBase is maturing fast, becoming both more operationally reliable and more feature rich.

Hadoop Summit Expands to Europe in 2013!

This will be the first and the largest European conference focused exclusively on accelerating the enterprise adoption of Apache Hadoop. The event will be a gathering for the vibrant Apache Hadoop community of developers, data scientists, data professionals and solution providers and will be held at the historic Beurs van Berlage in Amsterdam on March 20-21, 2013.

Call for papers now open!

Apache Hadoop practitioners, enthusiasts and solution providers with an idea for a talk at the event, can submit your ideas now on the call for papers page. All accepted speakers will receive complimentary admission to the event.

More information on Hadoop Summit Europe, go to: http://hadoopsummit.org/amsterdam.

Remember to follow us on Twitter and Facebook for future updates!

We hope to see you there!

Apache Hadoop YARN Meetup at Hortonworks – ReCap!

Introduction

The Apache Hadoop YARN meetup at Hortonworks on October 12, 2012 we previously announced was a resounding success. We had a very good turnout of around seventy people from the community.

Meetup sessions

Deployments at Yahoo!

The meetup kicked off with YARN committers from Yahoo presenting on current Hadoop 2.0 deployments at Yahoo. As part of the presentation, the following were covered.

  • described scenarios where YARN positively advanced the state of the art like scalability, its current stability, the power of the YARN web-services, and its superlative performance compared to the previous versions.
  • efforts undergone relation to battle testing YARN including application validation and performance benchmarking.
  • summed it up with suggestions for improvements to issues like UI loading, lack of generic history server etc.

Chris Riccomini’s on “Building Applications on YARN”


Chris Riccomini from LinkedIn then presented about his experience in “Building Applications on YARN”. He briefly covered the anatomy of a YARN application and then jumped into various dimensions a YARN application developer should think about – deployment, metrics, logging, application specific configuration to name a few.

The most interesting bits about his presentation include how, pre-production, small instances of YARN clusters can be utilized to develop applications in an agile manner. For example, one could start with using local file system and avoiding HDFS to minimize the operational effort, and then switch over to a full-blown distributed file system when the desire for scalability crosses a threshold. Also worth attention is how YARN’s web-service APIs can be exploited to build custom dashboards.

Chris posted his notes from the meetup and slides on his blog.

YARN API Discussion

After that, Arun recapped the YARN’s powerful scheduling API available to the application developers for using the cluster resources. He walked us through the scheduling concepts, and rounded it up with how scheduling happens in the context of an example MapReduce job.

Bikas and I then proceeded to give a brief overview of what all APIs are available to application developers. We described some of the pain points with the APIs that various users indicated in the recent past and efforts underway to address some of them. To enumerate a few:

  • How to make the scheduling logic explicit – for e.g, that scheduler looks for free resources on a node, then proceeds to a rack and then off-rack
  • Multiple ways to release and reject containers
  • Use-cases which require resources on specific nodes and/or racks
  • Applications that want to avoid/blacklist some nodes and/or racks
  • Limitations on the number of threads making resource requests

We opened the API discussion for further feedback. This exercise was very fulfilling. We discovered how various users were experimenting with the APIs and what pitfalls and limitations they ran into. Some concrete suggestions include:

  • Libraries for recovering AMs, launching containers
  • A generic framework for applications to expose specific data via http or web-services.
  • A generic application history server
  • Tagging nodes with labels like GPU etc and use these labels for scheduling. This is an extension of data locality

Our slides are available here.

Efforts Underway

After a short break, Alejandro Abdelnur from Cloudera briefly talked about the efforts underway to augment YARN with cpu-isolation using cgroups.

Finally, Siddarth Seth from Hortonworks talked about his work on modifying MR application to reuse containers for jobs both large and small. This exciting development opens new innovations in the MapReduce land like intermediate output aggregation. You can read through Sid’s presentation below. The core points covered are:

  • Decoupling the TaskAttempt and Container concepts inside MR AM
  • Add new first class concepts of Container, Node and Scheduler
  • The current state of the effort
  • New avenues this transition opens up – custom task types, output aggregation, performance optimizations.

His slides are available here.

Conclusion

The success of this meetup reaffirmed the excitement of the community about YARN. This also strengthened our desire to make it a recurring event. We look forward to the next one, with hopefully more turnout, extended brainstorming, and of course, more pizza and beer :)

Hortonworks & Teradata: More Than Just an Elephant in a Box

Today our partner, Teradata, announced availability of the Teradata Aster Big Analytics Appliance, which packages our Hortonworks Data Platform (HDP) with Teradata Aster on machine that is ready to plug-in and bring big data value in hours.

There is more to this appliance than meets the eye…  it is not just a simple packaging of software on hardware. Teradata and Hortonworks engineers have been working together for months tying our solutions together and optimizing them for an appliance. This solution gives an analyst the ability to leverage big data (social media, Web clickstream, call center, and other types of customer interaction data) in their analysis and all the while use the tools they are already familiar with.  It is analytics and data discovery/exploration with big data (or HDP) inside… all on an appliance that can be operational in hours.

Not just anyone can do this
This is an engineered solution.  Many analytics tools are building their solutions on top of Hadoop using Hive and HiveQL.  This is a great approach but it lacks integration of metadata and metadata exchange.  With the appliance we have extended a new approach using HCatalog and the Teradata SQL-H product.  SQL-H is a conduit that allows new analysis to be created and schema changes to be adopted within Hadoop from Teradata.  Analysts are abstracted completely from the Hadoop environment so they can focus on what they do best… analyze.  All of this is enabled by an innovation provided by HCatalog, which enables this metadata exchange.

Shortcut to Big Data Exploration
In the appliance, Aster provides over 50 pre-built functions that allow analysts to perform segmentation, transformations and even pre-packaged marketing analytics.  With this package, these valuable functions can now be applied to big data in Hadoop.  This shortens the time it takes for an analyst to explore and discover value in big data.  And if the pre-packaged functions aren’t explicit enough, Teradata Aster also provides an environment to create MapReduce functions that can be executed in HDP.

Lighting up operations
Often overlooked when an organization considers Hadoop is the impact on IT operations.  They are tasked with making sure a cluster is functional.  Well, these guys have countless tools to perform their job and for Teradata they use Viewpoint Teradata Vital Infrastructure.  In this release, we have integrated the management and monitoring communications use by Ambari with these monitoring tools. Now, the ops guy has a true single pane of glass to monitor the Teradata environment AND the Hadoop cluster used to provide the big data analytics.

Some details on the appliance
The Teradata Aster Big Analytics Appliance runs on proven Teradata hardware, leverages the most current Intel® processor chip technology, SUSE® Linux operating system, and market-leading enterprise-class storage. It can be configured to store a maximum of 5 petabytes of uncompressed user data for Aster and up to 10 petabytes of uncompressed user data for Hadoop.

“The Teradata Aster Big Analytics Appliance offers the faster path from diverse big data acquisition to big insights, and seamlessly delivers these insights to the business owners. Unmatched by any other stack in the industry, it enables organizations to overcome the barriers to big data analytics and provides a high-definition view of the business to optimize operations.”– Scott Gnau, president, Teradata Labs.

This is unique and it ushers in a new approach to big data analytics.

Big Data in London – Thoughts From the Tube

Hortonworks sponsored the O’Reilly Strata conference in earlier this month at the Hilton Metropole in London. It was great meeting big data enthusiasts at the conference. We had fun giving away our little green mascot and came away pleasantly surprised at the state of interest in Big Data in the UK and Europe. There were over 500 attendees, which for a first time conference is a very good result. Conversations ranged from introductory “What is Apache Hadoop?” to deep discussions regarding how Hadoop was being used in production today. After talking to other vendors, attendees and organizers it appears that the market is somewhere between 12 and 18 months less mature than the Big Data market in the US. That said we think adoption could occur more quickly in the US as the state of the technology and ecosystem evolves heading into 2013. Below are some perspectives from our team at this conference.

Inspiration from the Tube

Riding the tube around London we couldn’t help but take some guidance and inspiration from the prominently placed signs for the “Way Out” and frequent announcements warning travelers to “Mind the Gap”. These signs and notices as informal guidance for approaching the Big Data market.

Way Out

As more and more organizations realize that their current systems are at risk of being buried underground by the onslaught of Big Data many are starting to realize that Hadoop offers a Way Out.  How you ask? Because it gives them a low cost scale out infrastructure to capture, process and exchange data. With Hadoop they now can cluster commodity servers and storage together to capture, process and exchange data with existing systems. At the same time a modern enterprise ready Hadoop platform like the Hortonworks Data Platform enables them to efficiently and effectively operate these clusters as well but that is for another post.

Mind the Gap

That said when selecting a Hadoop platform it is important to Mind the Gaps in the technology and look for a platform that is being deeply integrated with existing enterprise architecture systems. The best solutions to rely on are those that are created through engineering level engagements to maximize performance and optimize the interaction between the systems.

Deep technical interest and curiosity

Many of the visitors had technical questions, for which we pulled in our UK R&D person, Steve Loughran, armed with copies of the Hadoop 1.x and trunk source trees. The content of those discussions showed that people are already using Hadoop at scale in parts of Europe and nearby. Indeed, we had conversations with people as far away as Finland and Israel, showing that this conference drew a wide audience – and that those people were building up their skills in the technology and applications of Big Data.

There was also the London-and-South of England Hadoop community, who tend to know each other from the London HUG events and other workshops. Many of these are drawn from various startups -Last.fm being one of the earliest adopters of Hadoop; Datasift, Mendeley and others now becoming well known. Alongside them: the enterprises with datasets that historically were too big to store cost-effectively: the telcos, the media companies with their advert click throughs, and the like. These people have the data -and are ramping up the skills to make use of it. For these organizations, bringing up large Hadoop clusters matters -and they’ve realized that Hadoop internals aren’t something they need to know themselves -any more than they need Linux kernel skills. What they do need is Data Science skills: people who know the right questions to ask of that data, how to ask Hadoop for the data to provide the answers, how to interpret those answers -and how to present them.

Many of the Strata topics looked at these problems: cleaning up data, conducting effective A/B tests, and examples of highly effective visualizations of large and near-real-time data sources. One memorable talk from the Formula 1 race team McLaren covered how they had transformed their organization to be data-driven; to use the answers from their in-race telemetry and information gleaned about competitors from public sources to shape their thinking. This shows a future for organizations -to copy McLaren, Google and others to not only collect and analyze data -but to embrace it.

Exciting future for Big Data in Europe

Overall we had many great conversations with attendees regarding their current and more commonly future plans for use of Hadoop and other Big Data technologies. Many of the sessions were packed including a standing room only Microsoft talk on current Hadoop related integration and future plans.

Awareness of Apache Hadoop as a technology was respectable but certainly below that in the US.

Interest in technical and business benefits of Hadoop

Shaun Connolly’s sessions on Hadoop and data warehousing were well attended, as was Steve Loughran’s session on High Availability Hadoop including a live demo.

Finally, Transport for London are themselves participants in the Big Data revolution -their live data feeds of tube, bus and bike-sharing are all there for analysis and integration with other data sources: http://www.tfl.gov.uk/businessandpartners/syndication/16493.aspx. If anyone wants some interesting datasets to learn Pig on, these could be them.

Overall, this was well run event and featured interesting keynotes. It was vibrant, ripe for growth, and was very honored to be approached by multiple user groups seeking speakers from Hortonworks to talk about big data experiences and expertise from this conference.

Thanks to those that attended our sessions and visited and chatted with us at our booth. For a copy of Shaun Connolly and Steve Loughran’s presentations, you can acces it here and here.

Until next time London, mind the gap.

Apache Hadoop 2.0.2-alpha Released!

It gives me great pleasure to announce that the Apache Hadoop community has voted to release Apache Hadoop 2.0.2-alpha.

This is the second (alpha) release of the next generation release of Apache Hadoop 2.x and comes with significant enhancements to both the major components of Hadoop:

  • HDFS HA has undergone significant enhancements since the previous release for NameNode High Availability
  • YARN has undergone significant testing and stabilization and validation as is been heavily battle-tested since the previous release.

These are exciting times indeed for the Apache Hadoop community – personally, this is very reminiscent of the period in 2009 when we finally saw the light at the end of the tunnel during the stabilization of Apache Hadoop 1.x (then called Apache Hadoop 0.20.x). A déjà vu, if you will – albeit of the pleasant kind! Yes, we have a few miles to clock, but it feels like the hardest part is already behind us. At the time of release, YARN has already been deployed on super-sized clusters with 2,000 nodes and 3,600 nodes (totaling to nearly 6,000 nodes) at Yahoo alone*.

Going forward, I have no doubt that we are well of our way to sign-off on hadoop-2.x early next year and we are now heads down wrapping up the last of feature work since we have a reasonably stable base, such as:

  • HDFS HA without need for shared storage (already merged into Hadoop trunk sans a couple of design enhancements).
  • YARN ResourceManager availability.
  • YARN scheduling enhancements such as multi-resource scheduling (nearly complete, should be committed soon) and preemption.

Having said that, it’s critical for the developer community to get feedback on hadoop-2.x from the user community to ensure we continue to deliver great software – so, please, do go ahead, download the bits from the Apache Hadoop releases page, try the release and give us your valuable feedback – it’s a personal request! Of course, if you prefer a fully packaged and integrated stack you can browse to the Hortonworks Downloads page to try Hortonworks Data Platform 2.0 Alpha which integrates Hadoop 2.0.2-alpha with other important components such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie

For more information about the HDP 2.0 alpha, you can check out our blog post from yesterday.

Acknowledgements
I’d like to thank everyone who has or continues to contribute to Apache Hadoop – everyone in the community. A special mention for Todd Lipcon for his contributions to HDFS HA and the Yahoo Hadoop team (Robert Evans, Thomas Graves, Daryn Sharp, Jason Lowe and everyone else) for their help in getting YARN to stability and large-scale deployments on their clusters.

*Yahoo is currently running hadoop-0.23.4 release which essentially is hadoop-2.0.2-alpha without HDFS high availability.

Hortonworks Data Platform 2.0 Alpha is Now Available for Preview!

We are very excited to announce the Alpha release of the Hortonworks Data Platform 2.0 (HDP 2.0 Alpha).

HDP 2.0 Alpha is built around Apache Hadoop 2.0, which improves availability of HDFS with High Availability for the NameNode along with several performance and reliability enhancements. Apache Hadoop 2.0 also significantly advances data processing in the Hadoop ecosystem with the introduction of YARN, a generic resource-management and application framework to support MapReduce and other paradigms such as real-time processing and graph processing.

In addition to Apache Hadoop 2.0, this release includes the essential Hadoop ecosystem projects such as Apache HBase, Apache Pig, Apache Hive, Apache HCatalog, Apache ZooKeeper and Apache Oozie to provide a fully integrated and verified Apache Hadoop 2.0 stack

Apache Hadoop 2.0 is well on the path to General Availability, and is already deployed at scale in several organizations; but it won’t get to the current maturity levels of the Hadoop 1.0 stack (available in Hortonworks Data Platform 1.x) without feedback and contributions from the community.

Hortonworks strongly believes that for open source technologies to mature and become widely adopted in the enterprise, you must balance innovation with stability. With HDP 2.0 Alpha, Hortonworks provides organizations an easy way to evaluate and gain experience with the Apache Hadoop 2.0 technology stack, and it presents the perfect opportunity to help bring stability to the platform and influence the future of the technology.

Learn More
Please take a look at the Hortonworks Documentation to learn more about installing and using HDP 2.0 Alpha.

To learn more about Apache Hadoop YARN, Arun Murthy — Chair of Apache Hadoop PMC and YARN/MapReduce lead – and the rest of Hortonworks YARN development team, have a great four-part Blog series on the technology: one, two, three and four.

Download It
You can download the HDP 2.0 Alpha bits from the Hortonworks Download site.

Tell Us About It
Please visit the HDP 2.0 Alpha Forum to ask questions, get help, provide feedback and hear what others are doing with HDP.

Note: This Alpha release is early access and not for production use. Support is only available via Forums. Additionally, this is an early access release, you might find some incomplete features or a bit of instability.

We are excited about the opportunities that Hadoop 2.0 provides for the future of Hadoop and Big Data. The HDP 2.0 Alpha release is just the beginning. Enjoy!

Teradata Webinar: Business Value with Big Analytics

Back in June we joined Teradata Aster in a webcast “Back to the Future – MapReduce, Hadoop and the Data Scientist” to highlight the benefits of Apache Hadoop and the role that data scientists are playing in big data. You can check out the replay here. The discussion focused around how big data architectures could bring more value to businesses using relational DBMS technology and Hadoop, and how the two can coexist.

On October 17th at 10am PDT, Teradata will host a webcast that raises the level and builds on the important theme of Hadoop and business value, recognizing that many are deeply involved with discovering the easiest and best way to bring their data to life. Teradata Aster plans to show how executives, analysts and IT managers can leverage breakthrough enterprise class big analytics solutions to inject innovative analytics into business processes for better data-driven decisions. All this while minimizing risk, maximizing ROI and accelerating time-to-value.

Read more or register for this webcast and join speakers Scott Gnau, President, Teradata Labs, Teradata Corporation, and Tasso Argyros, Co-President, Teradata Aster and get the inside scoop on Teradata Aster’s newest big analytics technology.

Go to page:12345...Last »